Verify working of PPO within the NSGA II code

schrum2 commented 5 years ago

I'm going to take care of this. I made a separate branch to deal with it: dev_schrum_ppo The goal is to make sure that we didn't break PPO by moving it into this code base.

schrum2 commented 5 years ago

Ran this over night and the results aren't right. Sonic keeps blindly rushing forward into an enemy and dying. He's not even exploring anymore ... what happened to epsilon? I'll need to keep messing with this.

schrum2 commented 5 years ago

In light of some small tweaks made by both Alex and I, I'm planning on running this again tonight. However, I went ahead and deleted the dev_schrum_ppo branch since dealing with the merge conflict was a hassle when I just incorporated some changes into dev_schrum. The small tweaks I do for testing purposes on this issue probably don't need to be committed ... unless I uncover a major issue.

schrum2 commented 5 years ago

This is weird. I tried training again, and had more of a chance to watch the agent. I saw some impressive things ... Sonic made a lot of progress early on, and even beat the level! However, as time went one Sonic's behavior eventually converged to fairly consistent bad behavior. That's not how learning should work! Specifically, the evaluations keep resulting in a score of about 649 which corresponds to running straight into the first robot and dying. Sometimes Sonic avoids death here but then seems to consistently get stuck pushing a wall and getting a fitness of about 1977.

Failing to improve is not too unusual, but I am a bit shocked that the behavior should converge on something consistently bad. So what is the issue?

The epsilon exploration should encourage random behavior that gets Sonic out of being stuck. Maybe this actually is happening ... the behavior is not completely identical every time. But then why so consistent and converged? Is the main reason for the early random behavior a lack of knowledge about what actions to do, rather than the epsilon? It is fairly small.

I think the most important thing to try next is to return to the original PyTorch PPO code and see what happens when running for a long time. Additionally, we should look closely at the hyperparameters used in the 3rd place tensor flow agent as well, and see how they differ from ours.

schrum2 commented 5 years ago

I ran the original PPO code in gym-http-api\pytorch-a2c-ppo-acktr-gail again (by executing launch.bat) over night, and confirmed that the problem is with PPO and not how we copied the code. The agent converges to regularly receiving a total reward of 649.8 and rushing into the first robot and dying.

So, the next thing to investigate are the hyperparameters. Look at the hyperparameters used with the tensorflow version of PPO and change the corresponding hyperparameter settings in launch.bat. In fact, there may even be some hyperparameters where we are simply using the default value defined in the arguments, but this may be insufficient.

nazaruka commented 5 years ago

Our agent:

learning_rate = 2.5e-4
epsilon = 1e-5
agent = ppo.PPO(
    actor_critic,
    clip_param=0.1,
    ppo_epoch=4,
    num_mini_batch=1,
    value_loss_coef=0.5,
    entropy_coef=0.01,
    lr=learning_rate,
    eps=epsilon,
    max_grad_norm=0.5)

In ppo2ttifrutti_agent.py, we find:

ppo2ttifrutti.learn(policy=policies.CnnPolicy,
                            env=DummyVecEnv([env.make_custom]),
                            nsteps=4096,
                            nminibatches=4,
                            lam=0.95,
                            gamma=0.99,
                            noptepochs=4,
                            log_interval=1,
                            ent_coef=0.01,
                            lr=lambda _: 7.5e-5,
                            cliprange=lambda _: 0.1,
                            total_timesteps=int(1e7),
                            save_interval=10,
                            ...)

Looking through ppo2ttifrutti.py, we notice two additional lines of code:

vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)

We find PyTorch statements equivalent to both in helpers\ppo.py: value_loss = 0.5 * torch.max(value_losses, value_losses_clipped).mean() and self.optimizer = optim.Adam(actor_critic.parameters(), lr=lr, eps=eps), respectively. Note that we declared our agent's epsilon value as 1e-5 = 0.00001.

lam=0.95 is taken care of through‬ rollouts.compute_returns(next_value, use_gae=True, gamma=0.99, gae_lambda=0.95, use_proper_time_limits=True) in NSGAII.py.

Seems like our main concern seems to be with the learning rate and the number of mini-batches.

nazaruka commented 5 years ago

As Borghi writes in his repo's README:

Compared to the original PPO2 baseline, besides starting from a pretrained model instead of learning from scratch, at test time the agent uses 96x96 inputs, a different CNN, smaller batch size and learning rate, and uses reward reshaping.

Our CNN is identical to his, but our learning rate and mini-batch size differ (learning rate is greater by a factor of around 3, mini-batch size is smaller). We also do not warp the frames to 96x96 or "reshape our rewards," both of which are handled as follows in ppo2ttifrutti_sonic_env.py:

env = make(game='SonicTheHedgehog-Genesis', state='GreenHillZone.Act1')
env = SonicDiscretizer(env)
if scale_rew:
    env = RewardScaler(env)
env = WarpFrame96(env)
if stack:
    env = FrameStack(env, 4)
env = AllowBacktracking(env)
return env

At the moment, our modified helpers\env.py does not appear to wrap Genesis games in such a fashion. We could just call wrap_deepmind(env) on it, but that code is quite different:

if episode_life:
    env = EpisodicLifeEnv(env)
if 'FIRE' in env.unwrapped.get_action_meanings():
    env = FireResetEnv(env)
env = WarpFrame(env)
if scale:
    env = ScaledFloatFrame(env)
if clip_rewards:
    env = ClipRewardEnv(env)
if frame_stack:
    env = FrameStack(env, 4)
return env

Additionally, we miss the frame skip aspect of the Sonic benchmark by leaving our Sonic environment as-is. OpenAI's "Gotta Learn Fast" states at the end of page four:

The step() method on raw gym-retro environments progresses the game by roughly 1/60th of a second. However, following common practice for ALE environments, we require the use of a frame skip [16] of 4. Thus, from here on out, we will use timesteps as the main unit of measuring in-game time. With a frame skip of 4, a timestep represents roughly 1/15th of a second. We believe that this is more than enough temporal resolution to play Sonic well.

Reference [16] links to Minh, Kavukcuoglu et al's "Human-level control through deep reinforcement learning," which emphasizes the importance of the frame-skipping technique:

Following previous approaches to playing Atari 2600 games, we also use a simple frame-skipping technique. More precisely, the agent sees and selects actions one very k-th frame instead of every frame, and its last action is repeated on skipped frames. Because running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k = 4 for all games.

To conclude, while the PPO agent does seem to underperform quite significantly, its performance is affected in more factors than just the hyperparameters, particularly in how the environment is handled.

schrum2 commented 5 years ago

Try changing the code in gym-http-api\pytorch-a2c-ppo-acktr-gail to reflect these differences and test it out

schrum2 commented 5 years ago

I have been messing with the launch.bat file to match the PPO parameters better: 697da647b7a5d886b0031b8d6ab99bba8ac59eff

schrum2 commented 5 years ago

I'm convinced that the code in pytorch-a2c-ppo-acktr-gail works using launch.bat thanks to the hyperparameters defined there. These were mostly taken from the 3rd place PPO entry in TensorFlow.

I can almost close this issue, but I want to make sure that learning works when running NSGA2 as well, but with better hyperparameters ... that will come next.

schrum2 commented 5 years ago

Had to tweak code in NSGA II to get it to run, but testing it in CPU mode is too slow. I need to run BaldwinianExample.bat from home

schrum2 commented 5 years ago

The starting policies from evolution seem more inclined to be stuck in bad behavior than the starting policies for neural networks. How are networks randomly initialized? Compare with randomly initialized genomes.

schrum2 commented 5 years ago

Commit above added the necessary option.

nazaruka / gym-http-api

Verify working of PPO within the NSGA II code #27