hsahovic / poke-env

A python interface for training Reinforcement Learning bots to battle on pokemon showdown
https://poke-env.readthedocs.io/
MIT License
297 stars 100 forks source link

Self Play #160

Closed mancho2000 closed 3 years ago

mancho2000 commented 3 years ago

Hey,

I have a bit of a selfish request this time :) I would like to make the agent play against a saved version of itself, but I am having a really tough time making it work. I saw someone else posted a similar question, but I am using RLlib and don't know how to do it. I kind of made it work, but for some reason both my agent and the opponent just return 0 as the chosen action when doing self play (if it plays against random player then it chooses actions normally). Could you please give me a hand with this?

hsahovic commented 3 years ago

Hey @mancho2000,

I can take a look at your code. Let me know if you don't want to share it publicly - we can arrange something if necessary.

mancho2000 commented 3 years ago

Thank you! Here you will find the code. The part of the battle embedding I have modified to make the same you provided in the example, since that part is not important for this purpose I believe.

This is the main file:

`import asyncio import numpy as np import ray import ray.rllib.agents.ppo as ppo import ray.rllib.agents.sac as sac import ray.rllib.agents.dqn as dqn import random import tensorflow as tf

from poke_env.player.bot_trained2 import BotTrained from asyncio import ensure_future, new_event_loop, set_event_loop from gym.spaces import Box, Discrete from poke_env.player.env_player import Gen8EnvSinglePlayer from poke_env.player.random_player import RandomPlayer from poke_env.environment.side_condition import SideCondition

STEP_COUNT = 0 class SimpleRLPlayer(Gen8EnvSinglePlayer):

def __init__(self, *args, **kwargs):
    Gen8EnvSinglePlayer.__init__(self)
    self.observation_space = Box(low=-10, high=10, shape=(10,))

@property
def action_space(self):
    return Discrete(22)

# We define our RL player
# It needs a state embedder and a reward computer, hence these two methods

def embed_battle(self, battle):

    moves_base_power = -np.ones(4)
    moves_dmg_multiplier = np.ones(4)
    for i, move in enumerate(battle.available_moves):
        moves_base_power[i] = (
                move.base_power / 100
        )  # Simple rescaling to facilitate learning
        if move.type:
            moves_dmg_multiplier[i] = move.type.damage_multiplier(
                battle.opponent_active_pokemon.type_1,
                battle.opponent_active_pokemon.type_2,
            )

    # We count how many pokemons have not fainted in each team
    remaining_mon_team = (
            len([mon for mon in battle.team.values() if mon.fainted]) / 6
    )
    remaining_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6
    )

    # Final vector with 10 components
    return np.concatenate(
        [
            moves_base_power,
            moves_dmg_multiplier,
            [remaining_mon_team, remaining_mon_opponent],
        ]
    )

def compute_reward(self, battle) -> float:
    return self.reward_computing_helper(
        battle, fainted_value=2, hp_value=1, victory_value=30,
    )

def observation_space(self):
    return np.array

class MaxDamagePlayer(RandomPlayer):

def choose_move(self, battle):
    # If the player can attack, it will
    if battle.available_moves:
        # Finds the best move among available ones
        best_move = max(battle.available_moves, key=lambda move: move.base_power)
        return self.create_order(best_move)

    # If no attack is available, a random switch will be made
    else:
        return self.choose_random_move(battle)

ray.init()

loading opponent

TRAINING_OPPONENT = 'TrainedPlayer' MODEL_NAME = 'PPO' config = ppo.DEFAULT_CONFIG.copy() config["num_gpus"] = 0 config["num_workers"] = 0 # Training will not work with poke-env if this value != 0 config["framework"] = "tfe" config["model"]["fcnet_hiddens"] = [64, 32] config["model"]["fcnet_activation"] = "relu" trained_opponent = ppo.PPOTrainer(config=config, env=SimpleRLPlayer) trained_opponent.restore(SAVED MODEL PATH)

my model

config = ppo.DEFAULT_CONFIG.copy() config["num_gpus"] = 0 config["num_workers"] = 0 # Training will not work with poke-env if this value != 0 config["framework"] = "tfe" config["gamma"] = 0.5 config["lr"] = 0.001 config["sgd_minibatch_size"] = 4000 config["train_batch_size"] = 4000 config["entropy_coeff"] = 0.001 config["timesteps_per_iteration"] = 128 config["num_sgd_iter"] = 30 config["model"]["fcnet_hiddens"] = [64, 32] config["model"]["fcnet_activation"] = "relu" trainer = ppo.PPOTrainer(config=config, env=SimpleRLPlayer) trainer.restore(SAVED MODEL PATH)

def ray_training_function(player):

player.reset_battles()

for i in range(1000):
    result = trainer.train()
    print(result)
    if i % 4 == 0 and i > 2:
        checkpoint = trainer.save()
        print("checkpoint saved at", checkpoint)

player.complete_current_battle()
print("FINISHED TRAINING")
checkpoint = trainer.save()
print("checkpoint saved at", checkpoint)

def ray_evaluating_function(player): player.resetbattles() for in range(100): done = False obs = player.reset() while not done: action = trainer.computeaction(obs) obs, , done, _ = player.step(action) player.complete_current_battle()

print(
    "PPO Evaluation: %d victories out of %d episodes"
    % (player.n_won_battles, 100)
)

env_player = trainer.workers.local_worker().env first_opponent = RandomPlayer() third_opponent = MaxDamagePlayer(battle_format="gen8randombattle") ppo_opponent = BotTrained( battle_format="gen8randombattle", trained_rl_model=trained_opponent, model_name=MODEL_NAME, )

training

print("\nTRAINING against random player:") env_player.play_against( env_algorithm=ray_training_function, opponent=ppo_opponent, )

evaluating

print("\nResults against random player:") env_player.play_against( env_algorithm=ray_evaluating_function, opponent=first_opponent,

)

print("\nResults against max player:") env_player.play_against( env_algorithm=ray_evaluating_function, opponent=third_opponent,

) `

This would be bot_trained2 from where I import BotTrained:

`# -- coding: utf-8 -- """ This module defines a frozen RL player """ from poke_env.player.env_player import Gen8EnvSinglePlayer from poke_env.player.player import Player from poke_env.player.battle_order import BattleOrder

from poke_env.player.ppo_bot2 import SimpleRLPlayer

import tensorflow as tf import numpy as np from gym.spaces import Box, Discrete import ray import ray.rllib.agents.ppo as ppo import ray.rllib.agents.sac as sac import ray.rllib.agents.dqn as dqn from asyncio import ensure_future, new_event_loop, set_event_loop

STEP_COUNT = 0

class BotTrained(Player):

def __init__(self, trained_rl_model, model_name, *args, **kwargs):

    # create trained_rl_model attribute from input parameter
    self.trained_rl_model = trained_rl_model

    # specify model name - changes the way the best move is selected
    # since different models have different ways of choosing a best move
    self.model_name = model_name

    # inherit all attributes and methods from parent class
    Player.__init__(self, *args, **kwargs)

    #Gen8EnvSinglePlayer.__init__(self)
    self.observation_space = Box(low=-10, high=10, shape=(2688,))
    # self.observation_space = Box(low=-10, high=10, shape=(442,))

@property
def action_space(self):
    return Discrete(22)

def embed_battle(self, battle):
    moves_base_power = -np.ones(4)
    moves_dmg_multiplier = np.ones(4)
    for i, move in enumerate(battle.available_moves):
        moves_base_power[i] = (
                move.base_power / 100
        )  # Simple rescaling to facilitate learning
        if move.type:
            moves_dmg_multiplier[i] = move.type.damage_multiplier(
                battle.opponent_active_pokemon.type_1,
                battle.opponent_active_pokemon.type_2,
            )

    # We count how many pokemons have not fainted in each team
    remaining_mon_team = (
            len([mon for mon in battle.team.values() if mon.fainted]) / 6
    )
    remaining_mon_opponent = (
            len([mon for mon in battle.opponent_team.values() if mon.fainted]) / 6
    )

    # Final vector with 10 components
    return np.concatenate(
        [
            moves_base_power,
            moves_dmg_multiplier,
            [remaining_mon_team, remaining_mon_opponent],
        ]
    )

def compute_reward(self, battle) -> float:
    return self.reward_computing_helper(
        battle, fainted_value=2, hp_value=1, victory_value=30,
    )

def observation_space(self):
    return np.array

def _action_to_move(self, action, battle) -> BattleOrder:

    """Converts actions to move orders.

    The conversion is done as follows:

    0 <= action < 4:
    The actionth available move in battle.available_moves is executed.
    4 <= action < 8:
    The action - 4th available move in battle.available_moves is executed, with
    z-move.
    8 <= action < 12:
    The action - 8th available move in battle.available_moves is executed, with
    mega-evolution.
    12 <= action < 18
    The action - 12th available switch in battle.available_switches is executed.

    If the proposed action is illegal, a random legal move is performed.

    :param action: The action to convert.
    :type action: int
    :param battle: The battle in which to act.
    :type battle: Battle
    :return: the order to send to the server.
    :rtype: str
    """
    if (
            action < 4
            and action < len(battle.available_moves)
            and not battle.force_switch
    ):
        return self.create_order(battle.available_moves[action])
    elif (
            not battle.force_switch
            and battle.can_z_move
            and 0 <= action - 4 < len(battle.active_pokemon.available_z_moves)
    ):
        return self.create_order(
            battle.active_pokemon.available_z_moves[action - 4], z_move=True
        )
    elif (
            battle.can_mega_evolve
            and 0 <= action - 8 < len(battle.available_moves)
            and not battle.force_switch
    ):
        return self.create_order(battle.available_moves[action - 8], mega=True)
    elif (
            battle.can_dynamax
            and 0 <= action - 12 < len(battle.available_moves)
            and not battle.force_switch
    ):
        return self.create_order(battle.available_moves[action - 12], dynamax=True)
    elif 0 <= action - 16 < len(battle.available_switches):
        return self.create_order(battle.available_switches[action - 16])
    else:
        return self.choose_random_move(battle)

def choose_move(self, battle):
    if battle.available_moves:
        action = self.trained_rl_model.compute_action(self.embed_battle(battle))
        return self._action_to_move(action, battle)
    else:
        return self.choose_random_move(battle)

`

When I modified a bit the code maybe there could be some mistake, but the core part where I make the trained opponent and call it was unchanged.

mancho2000 commented 3 years ago

sorry for the mess, I don't know how to format this properly in here :)

hsahovic commented 3 years ago

I ran the training part of the code above successfully, albeit with a smaller number of iterations.

I think your agents might just end up in infinite loops from time to time (eg. infinite switch loops)? Are you seeing something specific that make you think something is wrong?

PS: to format code you can use create a code block encompassing text in a two lines containing only ```, with an optional language in the first one eg.

| ```python
| def say_hello():
|     print("Hello, world!")
| ```

yields

def say_hello():
    print("Hello, world!")
mancho2000 commented 3 years ago

The problem that I have is that my agent plays well when the opponent is randomplayer, or maxplayer (and trained_opponent is commented out). When I make it play against itself, then both the agent and the opponent just choose the first move they have (I printed the action they return and its always 0).

I had the idea that the problem could be related to calling also SimpleRLPlayer in:

trained_opponent = ppo.PPOTrainer(config=config, env=SimpleRLPlayer)

so I tried just setting up the opponent in the same way, but anyways playing against randomplayer and not using ppo_opponent as opponent:

first_opponent = RandomPlayer()
third_opponent = MaxDamagePlayer(battle_format="gen8randombattle")
ppo_opponent = BotTrained(
battle_format="gen8randombattle",
trained_rl_model=trained_opponent,
model_name=MODEL_NAME,
)
print("\nTRAINING against random player:")
env_player.play_against(
env_algorithm=ray_training_function,
opponent=first_opponent,
)

which again makes my agent just choose the first move it has. Do you think its possible that the problem could be env=SimpleRLPlayer for trained_opponent? Or maybe that both the trainer and the trained_opponent are ppo.PPOTrainer?

Edit: I just tried connecting it to showdown, and the same is happening :( only chooses the first move for each mon (compute_action returns 0)

hsahovic commented 3 years ago

@mancho2000 I saw you closed this - have you found a solution? If not, I can take a deeper look at it over the weekend.