Feature: Different exposing of gym environment

hsahovic / poke-env

A python interface for training Reinforcement Learning bots to battle on pokemon showdown

https://poke-env.readthedocs.io/

MIT License

290 stars 98 forks source link

Feature: Different exposing of gym environment #207

Open Benjamin-Etheredge opened 2 years ago

Benjamin-Etheredge commented 2 years ago

Have the code base register a gym environment. I recently saw a codebase that seemed to register its environment with gym. This appears simple to do in the code base. That way anyone who installs/imports poke-env will be able to create a battler with gym.make(...).

This would require a few things.

A showdown server already running somewhere
A rewrite of the PlayerEnv or a new PlayerEnv.
Registering with gym through somewhere in the codebase.

I have some ideas for a PlayerEnv rewrite. The basic idea is to have PlayerEnv keep track of opponents and opponent IDs. When given an opponent, that opponent will be set to just accept all challenges. This allows the player to just challenge whatever opponent ID whenever it wants. Default opponents can also be set such as base PlayerEnvs for random or the baseline models that exist in the codebase.

This is to facilitate easier access to a gym wrapper as well as add support for self-play. By simply pushing more opponents onto the list, different opponents can be cycled every game. A sample method can also be added to allow flexibility in how opponents are challenged and in what order.

I know this description likely isn't very clear. I'm working on implementing it now. I'll try to get some preliminary code in a pull request soon for feedback.

hsahovic commented 2 years ago

@Benjamin-Etheredge yeah, both this and deep PlayerEnv rewrites are planned, but the current plan depends on other deep backend changes (making things more synchronous in general). I was thinking of maybe having something like a PlayerEnv.set_opponent method or even PlayerEnv.opponent attribute, that would default to a RandomPlayer - do you think that would be a good api?

Benjamin-Etheredge commented 2 years ago

I think going more synchronous is probably the way to go. Maybe just throw things on other processes? I'm not super familair with python multi-threading/processing.

I sorta did something similar. I had an opponent with a set_policy method. Then had the learner call that to update it. Something like having a set_opponent was something I considered. It would probably work just fine.

For self-play stuff, I'm debating whether the env should keep track of which opponents are played and which are in the bank of opponents. By letting the environment keep track of it, the learner still has to call add_opponent. That's not much different than just letting the learner track opponents and call set_opponent though.

So I'm leaning more towards just having a set_opponent or set_opponent_policy type thing. I'm not totally sure what the best way is yet though.

Also, I say don't worry about defaulting to a random opponent. Have it register a few things like RandomOpponent-v0 and MaxDamageOpponent-v0. Then when calling gym.make(...), an opponent can be selected. As long as those environments implement a set_opponent or set_policy, they can also be used for self-play.

I don't know. I'm just brainstorming at this point.

Benjamin-Etheredge commented 2 years ago

@hsahovic I kind of want to implement it in a way to work for RandomPlayers and ones that are handed a policy to follow. The messy part is bridging the embeddings over.

Would it be better to just create a DummyPlayer that takes policies in? So the user would need to create callables that take in a Battle and output and int. So this:

class DummyPlayer(Player):
    def __init__(
        self, 
        *args, 
        policy: Optional[Callable[[Battle], int]] = None, 
        **kwargs
    ):
        super().__init__(*args, **kwargs)
        self._policy = policy

    def choose_move(self, battle) -> BattleOrder:
        if self.policy is None:
            return self.choose_random_move(battle)

        action = self._policy(battle)
        return self.create_order(action)

Or would it be better to create the DummyPlayer within PlayerEnv so that it can access embed_battle?

Benjamin-Etheredge commented 2 years ago

Or, the PlayerEnv could do the wrapping of the policy. It could have a maker for using its own embed_battle to output a policy.

hsahovic commented 2 years ago

I think that it's easier to create a custom class instead of having a DummyPlayer system, eg:

class CustomPlayer(Player):
    def choose_move(self, battle):
        # Custom policy here

You can also add policy parameters if need be:

class CustomPlayer(Player):
    def __init__(
        self, 
        *args, 
        policy_parameters,
        **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.policy_parameters = policy_parameters

    def choose_move(self, battle):
        # Custom policy that uses self.policy_parameters

# You can update parameters easily too
custom_player = CustomPlayer(..., policy_parameters)
custom_player.policy_parameters = new_policy_parameters

Regarding the opponent, I'm thinking of removing the opponent arg from play_against and instead have a _opponent property with a setter, that can be changed mid-play_against call or anywhere in the code.

Benjamin-Etheredge commented 2 years ago

I like the idea of making _opponent a property. That could work nicely.

I would lean away from passing in parameters for it to use a policy with. I like the idea of making users give it a Callable that takes in a Battle and returns a BattleOrder. It makes it more flexible.

There was a bug in my DummyPlayer. Below is the one I have working now.

class DummyPlayer(Player):
    def __init__(
        self, 
        *args, 
        policy: Optional[Callable[[Battle], BattleOrder]] = None, 
        **kwargs
    ):
        super().__init__(*args, **kwargs)
        self._policy = policy if policy is not None else self.choose_random_move

    def set_policy(self, policy: Callable[[Battle], BattleOrder]):
        self._policy = policy

    def choose_move(self, battle: Battle) -> BattleOrder:
        return self._policy(battle

By forcing the user to create the policy Callable, the user implemented embedder can be injected in like so:

class PlayerEnv(Player, Env, ABC):
    ...
    def set_opponent_policy(self, policy: Callable[[Any], int]) -> None:
        def policy_wrapper(battle: AbstractBattle) -> BattleOrder:
            battle_encoding = self.embed_battle(battle)
            action = policy(battle_encoding)
            return self._action_to_move(action, battle)

        self._opponent.set_policy(policy_wrapper)

Now anyone who extends PlayerEnv can just call set_opponent_policy with their policy logic that takes in whatever they want (i.e., what comes out of their embed_battle) and have it create the appropriate move order.

Benjamin-Etheredge commented 2 years ago

After thinking on it some more, I like your approach of having an _opponent and setting it. I'm not able to see which policy from my queue of policies is being battled currently. With the _opponent way, it would allow better tracing when watching the showdown battles. I'll try switching to that type of approach soon.

Benjamin-Etheredge commented 2 years ago

Another path to explore would be to simplify PlayerEnv to just return battles for step. Having users extend the class to generate embeddings can get messy and complicated. It might be better for them to write wrappers to generate embeddings with a gym wrapper rather than class inheritance. That also appears to be my gym-like.