hsahovic / poke-env

A python interface for training Reinforcement Learning bots to battle on pokemon showdown
https://poke-env.readthedocs.io/
MIT License
297 stars 103 forks source link

Live training feed, visualisation of agent performance data? #148

Closed Trailblazin closed 3 years ago

Trailblazin commented 3 years ago

Hi @hsahovic ,

I have been working with the Poke-Env environment for a couple of months as the experimental basis for my Bachelor's Thesis; I very much appreciate the work done here as it's a novel yet relatable application of Reinforcement Learning and I hope to do this work justice in some way.

In light of this, and is probably relevant to your own publication, I would like to ask for you assistance in developing a means to showcase an agents performance metrics. I did not see any examples on how to do this in the documentation, however, I think this would be a good to discuss.

I have developed a notebook that uses the StableBaselines3 to visualise the reward per timestep, however, I feel as if this is not particular useful to any Pokemon Player and does not showcase the skill of an agent. Is there any way to visualise other metrics such as damage per turn/timestep, as well as, obtain the replay of a training/testing session via the local/online Showdown server?

hsahovic commented 3 years ago

Hey @Trailblazin,

Thanks for your feedback. The reward is interesting as it relates to your training process, but I agree that it can be harder to interpret. Player objects implement a win rate property, which could be a first metric you could plot. Going one step further, we could implement win rate in last n battles, average number of opponent / own kos per game, average number of turns per game, average number of switches / attacks, average final hp difference, etc. Which ones would you consider a priority (either among the ones I mentioned or something else)?

Icemole commented 3 years ago

against_buenillo I also had the same problem as you. The graph above shows the reward of the agent with respect to the episodes (see blue "line" on the image). What I did was to implement a simple incremental mean, where the slot i was the mean of all previous episodes. Maybe it would be better to consider a limited number of episodes, for instance maybe a window of 1000 or 2000 episodes would work better to show the evolution of the player.

I inserted the code in dqn_training(), after the call to player.complete_current_battle() (I guess you're training a DQN agent). If you are in a hurry, here is the code (@hsahovic feel free to use it or optimize it as you will). As I said, it incrementally computes the mean. It's very simple but if you see any typos or errors, please feel free to correct them:

x = training_history.history["nb_steps"]
y = training_history.history["episode_reward"]
avg_y = []
avg_current = 0
for val in y:
    # Incremental mean calculation
    avg_current = (val + avg_current * len(avg_y)) / (len(avg_y) + 1)
    avg_y.append(avg_current)
plt.plot(x, y, label = "Recompensa")  # Reward at episode i
plt.plot(x, avg_y, label = "Recompensa media")  # Mean reward at episode i
plt.legend()
plt.show()
Trailblazin commented 3 years ago

Hi @hsahovic ,

Many thanks for the swift reply.

All of these are worthwhile metrics to explore, these are probably the most worthwhile metrics for overall performance. If the Player object contains a win rate property, I'll try and see if I can work on this immediately, using the provided training example.

I would like to add that I was asking this question because , to be brief, working with RL/ML is still new to me whilst I've played Pokemon competitively; I know what I want to do with the agent but not how to attempt to do it. Whilst I would appreciate the effort, if you do not find it useful to implement such metric plotting, the advice is more than satisfactory.

Nonetheless, I would also like to consider, if the Player object or Environment allows, the measuring status-relevant and hazard-relevant metrics. This would include metrics such as. n of status moves used, n of event calls per status type: turns lost to paralysis/frozen/sleep, damage taken from burn/poison/toxic, health regained from Leech Seed/Rooting (Ingrain); n of stat modifiers afflicted by hazards and damage afflicted by hazards.

Now, to answer your question on priority, firstly, the standard performance metrics you have given are the priority. Win rate showcases average performance, agent KO to opponent KO ratio showcases performance per game, average final hp difference showcases the degree of performance as a win with a higher HP difference implies greater degree of performance.

Referring to my experience in previous generations of VGC play, the next most important factor is considering entry hazard and status effects. This is because these factors would allow the agent to gain additional reward on a timestep; in more layman terms, free damage/debuffs against the enemy; especially for Stealth Rock. These accumulate over the course of a match and can guarantee ko's where otherwise unavailable on a particular turn.

A particular example is from way back in gen5: Garchomp vs Landorus-Therian; with all smogon variants this is a potential 2HKO 1v1 scenario for both sides (3HKO worst, statistically favors Landorus), assuming neutral start. With the inclusion of Stealth Rocks to either side, Landorus is guaranteed to win in two turns whereas Garchomp requires a high damage roll. However, if Landorus is poisoned the turn prior , Garchomp will always win. There are several others I could pull from, if I looked at competitive data for this generation/7th generation, to further support my notion behind the potential statistical importance of hazards/status.

Nonetheless, average number of turns per game and average number of switches / attacks elude to something much more engaging: agent tactics. With these metrics, we may be able deduce how aggressive/reactive/passive the agent is on a particular episode and thus, which style of play works against particular team sets.

Apologies for the long reply, I had a lot of ideas that I had considered after taking some time away. Which of these, if any, are available to examine in the present state of the program?

hsahovic commented 3 years ago

@Trailblazin Thanks for taking the time to write this up :) @Icemole 's comment shows one way of keeping track of average reward; you can keep track of other metrics you'd like to define similarly.

In the short-term, I will not add dedicated support for metrics, as there are a couple of things I want to do first. Metrics will be added later, though. That being said, I put together a couple of functions that might correspond to some of what you are looking for. I have not thoroughly tested them.

    def last_n_games(player, n=10):
        sorted_games = [game for _, game in sorted(player.battles.items())]

        if len(sorted_games) > n:
            sorted_games = sorted_games[-n:]

        return sorted_games

    def last_n_games_win_rate(player, n=10):
        games = last_n_games(player, n)
        return len([g for g in games if g.won]) / len(games)

    def last_n_games_hp_difference(player, n=10):
        games = last_n_games(player, n)

        hp_difference = []

        for g in games:
            my_hp = 6
            their_hp = 6

            for mon in g.team.values():
                my_hp += mon.current_hp_fraction - 1

            for mon in g.opponent_team.values():
                their_hp += mon.current_hp_fraction - 1

            hp_difference.append(my_hp - their_hp)
        return sum(hp_difference) / len(hp_difference)

    def last_n_games_lenght(player, n=10):
        games = last_n_games(player, n)
        return sum([g.turn for g in games]) / len(games)

    def last_n_games_remaining_mons_difference(player, n=10):
        games = last_n_games(player, n)

        mon_differences = []

        for g in games:
            my_mons = 6 - len([m for m in g.team.values() if m.fainted])
            their_mons = 6 - len(
                [m for m in g.opponent_team.values() if m.fainted]
            )
            mon_differences.append(my_mons - their_mons)

        return sum(mon_differences) / len(mon_differences)

    def last_n_games_finished_with_entry_hazard(player, n=10):
        games = last_n_games(player, n)
        with_entry_hazard = 0
        entry_hazard = {
            SideCondition.SPIKES,
            SideCondition.STEALTH_ROCK,
            SideCondition.STICKY_WEB,
            SideCondition.TOXIC_SPIKES,
        }

        for g in games:
            if entry_hazard.intersection(g.opponent_side_conditions):
                with_entry_hazard += 1

        return with_entry_hazard / len(games)
Trailblazin commented 3 years ago

@hsahovic Thanks again for getting back so quickly, I hope my write-up was not too wordy without being informative in someway :D

I totally understand such metrics not being a priority for the environment and I appreciate the provided functions, they'll be of great assistance. From my understanding, the player object holds all the turn data for a particular battle and we can use this to isolate relevant data from a timestep in the play session. (I'll take a proper read through the documentation to better understand this, in case I'm wrong.) You've applied this to allow iteration across a collection of games to calculate various metrics.

@Icemole Thank you for a solution to plotting the average reward per episode, this proves rather helpful and shows a more comprehensive representation than my adapted moving average example from Stable Baselines (I'd forgotten to link this earlier). I am using this on a DQN network but I'm also attempting to apply this to other algorithms and frameworks.

I'll try to use these functions in the coming days to experiment with metric plotting and see what I can come up with, I'll happily inform you of my results after some time.