nazaruka / gym-http-api

NSGA2-based Sonic agent + experimental code
MIT License
1 stars 1 forks source link

Novelty Search #16

Closed schrum2 closed 5 years ago

schrum2 commented 5 years ago

Looking at how Sonic fails and also reading up some of the reports on the various attempts at the Sonic domain, I really think that Novelty Search ( https://www.cs.ucf.edu/eplex/noveltysearch/userspage/ , https://arxiv.org/abs/1712.06560 ) could be beneficial in this domain. In particular, the reward is very deceptive. Sonic is rewarded for moving to the right, but many interesting levels require significant backtracking.

However, typical Novelty Search as implemented on small neural networks is not scalable. What we need is Novelty Search with Deep Networks, which is the code that we've already had trouble running on Windows: https://github.com/uber-research/deep-neuroevolution

However, there are many implementations of Novelty Search out there, and we may not need to use this one. In particular, all Novelty Search is really doing is changing the fitness function, but for Deep Neuroevolution to be successful, people typically use CMA-ES. In fact, one of the contest entrants used CMA-ES combined with a large variety of complicated other stuff: https://github.com/dylandjian/retro-contest-sonic

If we remove the complicated World Models aspect from that code and just run the CMA-ES part, but then replace the fitness function with a Novelty based one, then we might create an agent that can move left sometimes.

Here are several bite-sized sub-tasks associated with this issue that I would like you to tackle:

  1. Create an agent that plays the game using a randomly initialized deep network architecture that DOESN'T learn. All of the algorithms we've been playing with use a network in this manner, but I think that most of the code for actually doing this is typically buried deeply in the code. You need to move this to the forefront. Basically, what you need is a function called "evaluate" that takes the weights of the network as a list, puts them together in an architecture, and then uses the network to play the game.
  2. Develop a way of tracking Sonic's behavior for the sake of creating a behavior characterization to use with Novelty Search. A typical example would be periodically logging Sonic's x/y coordinates throughout the duration of evaluation.
  3. After completing step 1, try using CMA-ES to evolve the weights of the arbitrary network. The repo above should have an example of this. Note that there is a python installation of cma-es that you can simply pip install. Use the standard rewards as fitness for this.
  4. Once 1, 2, and 3 are done, you can replace the "standard" fitness with a Novelty based fitness. This would typically be the average Euclidean distance between different behavior characterizations. However, we will need a way to account for the fact that different agent evaluations will be of different lengths.
schrum2 commented 5 years ago

No Novelty Search in PyTorch-NEAT. Just follow the roadmap above.

nazaruka commented 5 years ago

I found a few resources that have to do with loading the x-y coordinates from a Retro "movie" - https://medium.com/@tristansokol/making-fun-visuals-history-maps-and-other-tool-improvements-eb5ffe187fd3 https://github.com/AurelianTactics/sonic_1_utilities though I'm not sure how they can be used to track x-y coordinates as we run the level. retro_env.py, which the retro_contest.local references through importing retro, does have some details regarding x and y positions; I could pull them from there and see where it goes

schrum2 commented 5 years ago

Look at the sonicNEAT repo. I'm fairly sure it references the x coordinates at least.

nazaruka commented 5 years ago

I think I've found an applicable solution. Examine ppo2ttifrutti_sonic_env.py's make_custom method:

def make_custom(stack=True, scale_rew=True):
    env = make(game='SonicTheHedgehog-Genesis', state='GreenHillZone.Act1')

    env = SonicDiscretizer(env)
    if scale_rew:
        env = RewardScaler(env)
    env = WarpFrame96(env)
    if stack:
        env = FrameStack(env, 4)
    env = AllowBacktracking(env)
    return env

The purpose of these calls is to load an environment that fits with the specifications assigned by all these classes. SonicDiscretizer wraps a Gym-Retro environment and assigns it actions from Sonic to a discrete action space, RewardScaler reduces rewards by a factor of 100, WarpFrame96 reduces frames to a size of 96x96, and FrameStack is a class in /baselines/common/atari_wrappers.py that creates a stack with a given number of frames (in our case, four). The only one of these classes that has a step function is AllowBacktracking. Its step function starts off by declaring the following:

obs, rew, done, info = self.env.step(action)

If we write a command print(info) right after this call, we will receive several prints of the following form (changing values, of course):

{'prev_lives': 3, 'xpos_last_x': 0, 'level_end_bonus': 0, 'rings': 7, 'prev_progress': 0, 
'score': 30, 'zone': 0, 'act': 0, 'screen_x_end': 9407, 'screen_y': 573, 'offset_x': -80, 
'lives': 3, 'x': 5499, 'y': 673, 'screen_x': 5339}

We know that info is an array of variables pertaining to Sonic's action space. Additionally, if we print out the x according to an aggregate reward at this step (what AllowBacktracking calls self._cur_x—in this case, 51.40282457470901), we notice that it is close to x and screen_x if scaled by a factor of 100. AllowDiscretizer reasons the difference between the scaled aggregate reward and the two x values with the following commands:

rew = max(0, self._cur_x - self._max_x)
self._max_x = max(self._max_x, self._cur_x)

From this, we can infer that there will be nothing added to self._cur_x if it is less than self._max_x, but why is it losing reward when it does not get past the Green Hill Zone Act 1 loop (self._cur_x goes to 52.09527823686597 and drops to as low as 49.79026137590422 when Sonic tries to backtrack)? Then again, this property could be handled by a separate function and not by AllowBacktracking, which seems to reduce the reward to zero if Sonic tries to backtrack.

In short, we may get the x and y coordinates if we simply call xpos = info['x'] and ypos = info['y'], but we are now tasked with finding the reward function and how it seems to manipulate the rewards.

nazaruka commented 5 years ago

Should also note that I don't think y plays any role in how the reward is calculated. Seems to be much more of an x thing given AllowBacktracking's declarations

schrum2 commented 5 years ago

This is all useful information in general, but for the sake of this particular issue we just want to focus on logging the x/y coordinates over time. Do a run where you log all x/y coordinates to a separate file for each episode. Then you can plot those files using Excel or gnuplot with each file in a different color, and see the movement paths in the level. This will be a good visual confirmation that this works. Once we have that, we can think about how to make a uniform behavior characterization for use by Novelty Search.

nazaruka commented 5 years ago

After much deliberation, I believe I have gotten the general format down for saving and loading plots. Within a new file ppo2plotter.py, I created two methods, createCSV and genPlot, to test a specific case with static values. Plotting values in Python is accomplished with the aid of the matplotlib.pyplot and pandas packages, which work off of one another and offer a considerable degree of flexibility based on optional arguments.

import csv
import matplotlib.pyplot as plt
import pandas as pd

def createCSV(num):
    with open('data%i.csv' % num, 'w', newline='') as csvfile:
        fieldnames = ['timesteps', 'xpos', 'ypos']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for i in range(10):
            writer.writerow({'timesteps':i*1000, 'xpos':i, 'ypos': 10 - i})

def genPlot(num):
    df = pd.read_csv("data%i.csv" % num)
    plt.plot('timesteps', 'xpos', data=df, marker='', color='red', linewidth=2)
    plt.plot('timesteps', 'ypos', data=df, marker='', color='blue', linewidth=2)
    plt.title('Episode %i' % num)
    plt.xlabel('Timesteps')
    plt.ylabel('Value')
    plt.legend()
    plt.show()

Running a main method by calling createCSV five times in a for-loop generated identical files data0.csv, data1.csv, ..., data4.csv:

timesteps xpos ypos
0         0    10
1000      1    9
2000      2    8
3000      3    7
4000      4    6
5000      5    5
6000      6    4
7000      7    3
8000      8    2
9000      9    1

and calling genPlot(2) put out the following image through the prompt: image

Awesome, now we know how to plot data. To make this code relevant to Sonic, I am tasked with three objectives:

  1. Ensuring that createCSV will receive the x and y values from the action space in a given episode, terminating writing when the episode is done. Fulfilling this point means that I would have to expand the list of parameters from one (just num) to four (episode number, timesteps, x, y)
  2. Writing the directory path to the same location in Temp as checkpoints. checkpoints being saved there is accomplished through the logger package and an import from baselines; creating a plots directory to hold all the episodes' CSV files would probably be the best course of action
  3. Finding an effective way to plot the data (a call like python -m ppo2plotter genPlot(2) works, but could we ameliorate the inaccessibility a little?)
nazaruka commented 5 years ago

Graphing the x-y coordinates of a Sonic agent

A quick disclaimer

Before I begin discussing how I managed to integrate x and y plotting as part of the Sonic agent, it would be crucial for me to emphasize a certain detail regarding this implementation of PPO. There is no implementation of "episodes" so much as there are updates that occur depending on the ratio between total timesteps and nsteps. This fact alone makes it more complicated to track x and y based on episodes than it is for an algorithm like DQN, which takes an explicit number of episodes into consideration.

Correcting

First, I created a method in ppo2plotter.py to initialize a directory called plots in the same location as checkpoints:

def createPlotPath():
    plotdir = osp.join(logger.get_dir(), 'plots')
    os.makedirs(plotdir, exist_ok=True)
    return plotdir

Of course, logger is used as a condition for when checkpoints is initialized, so it would make sense to use logger as a condition for plots in ppo2ttifrutti.py:

if logger.get_dir():
        plotdir = plot.createPlotPath()

Next, I had to ensure that a CSV would be created at the beginning of each update, as doing so when the update was finished would mean not being able to receive the coming x and y coordinates at all. I understood that this meant creating separate createCSV and newRow methods in ppo2plotter.py, with newRow being called every time the coordinates would update. The methods ended up looking like this:

def createCSV(plotdir, num):
    with open(('%s/up%.5i.csv' % (plotdir, num)), 'w', newline='') as csvfile:
        fieldnames = ['timesteps', 'xpos', 'ypos']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
    print('Update %i CSV created at %s' % (num, plotdir))

def newRow(plotdir, num, timesteps, xpos, ypos):
    with open(('%s/up%.5i.csv' % (plotdir,num)), 'a', newline='') as csvfile:
        fieldnames = ['timesteps', 'xpos', 'ypos']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writerow({'timesteps':timesteps, 'xpos':xpos, 'ypos':ypos})

As it turns out, much of ppo2ttifrutti.py's learn method is based on the aforementioned updates and not on altering values at every step. The latter task is handled by the file's Runner class, which is called within learn by way of the run method. Therefore, in order to create a CSV for a given update and safely add new rows every time x and y were updated at every step, I called newRow within Runner's run method and createCSV within learn.

class Runner(AbstractEnvRunner):
...
    def run(self, update, plotdir):
        ...
        for s in range(self.nsteps):
            ...
            self.obs[:], rewards, self.dones, infos = self.env.step(actions)
            print(infos)
            info = infos[0]
            plot.newRow(plotdir, update, s, info['x'], info['y'])
...

def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
            vf_coef=0.5,  max_grad_norm=0.5, gamma=0.99, lam=0.95,
            log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
            save_interval=0, load_path=None):
    ...
    runner = Runner(env=env, model=model, nsteps=nsteps, 
                       total_timesteps=total_timesteps, gamma=gamma, lam=lam)
    ...
    nupdates = total_timesteps//nbatch
    for update in range(1, nupdates+1):
        plot.createCSV(plotdir, update)
...

This format allowed a CSV to be created at C:/Users/Admin/AppData/Local/Temp/openai-YYYY-MM-DD-hh-mm-ss-######/plots/ and updated with a new row of timestep, x, and y values.

Results and observations

These results were plotted using the aforementioned genPlot method, which requires the pandas package on top of the csv and matplotlib.pyplot packages. genPlot reads the CSV file assigned to the given episode and does the following:

I used an agent pre-trained on 60 updates in Sonic 1's Green Hill Zone Act 2. image

As shown, the agent swiftly maneuvers through the level until about 200 timesteps. From that point on, with occasional dips coming from attempted backtracking, x fluctuates somewhere around 2300 and y around 670. We are able to figure out that the agent got stuck, particularly at the obstacle in pink:

image

From the graph, we can deduce that the erratic y coordinate plot at the point where the agent got stuck represents jumping. Once the agent learns to overcome this obstacle, we should expect to see less such "jumps" and more of a dip down before coming back up again if ever the agent gets stuck there again. NB: Interestingly enough, y seems to be getting more positive the lower an agent is in the level. I wonder why that is?

Improvements

Obviously, there is still some room for improvement, particularly with the aforementioned issue regarding episodes. A potential fix would lie in calling createCSV every time x and y are at a level's starting coordinates, but this would mean not only knowing the distinct starting coordinates of every single level (e.g. Sonic 1's Green Hill Zone Act 1 has starting coordinates (80, 944), but Act 2 has starting coordinates (80, 252)) but also ensuring that they do not interfere with paths of other levels. The other issue in need of improvement is merely the plotting, which could be improved on in terms of execution and in result (particularly with setting fixed axis limits rather than those dependent on CSV values).


Edit: the graph should be labelled "Update 2," not "Episode 2."

nazaruka commented 5 years ago

I worked on this a little more and managed to get a solution dependent on the initial x and y coordinates working, but there were issues:

Seems as though the best course of action would be to run this on an agent that has been trained considerably (say, after 300 or so updates) and to declare a condition that prevents CSV files from being created at the first few timesteps (save for, of course, the initial one)

nazaruka commented 5 years ago

Otherwise, the episode plotting seems to work. Here's a sample of the plot from running the same agent on Green Hill Zone Act 1 (using Act 2 knowledge). You can see that it stopped plotting at the timestep where Sonic hit a spike and died, as well as curves that are nowhere near as steep as the Green Hill Zone Act 2 graph above. image

schrum2 commented 5 years ago

You're currently plotting x and y separately with time implicitly on the x axis. What I want is a plot of (x,y). Basically, the plot should depict Sonic's travel path, and the line can move forward and backward. If you want to track time, you can have the color intensity of the line change across time, though this would just be a bonus.

schrum2 commented 5 years ago

The behavior characterization developed here can be used in our code for #24 . Attention should be focused on that issue instead of this one ... we will implement Behavioral Diversity instead of Novelty Search. So, I'm going to close this issue.