miyosuda / unreal

Reinforcement learning with unsupervised auxiliary tasks
Other
416 stars 131 forks source link

The implementation? #3

Closed pengsun closed 7 years ago

pengsun commented 7 years ago

Hi, thanks for posting the code, really nice work!

I remember when I checked this repository a few weeks ago the readme.md said it did not work well - even worse than base A3C over the Atari game "breakout".

Now the code seems work well. Could you comment on what improvement did you make to manage this? Any technical details/tricks to share? Any pitfalls to avoid? Thanks so much!!

miyosuda commented 7 years ago

@pengsun Thank you for checking this repo.

First I tried Atari breakout, and the score was not good, but after that I tried 3D maze environment with DeepMind Lab, and the I could see learning result.

I just changed environment for DeepMind Lab and changed hyper parameters for 3D environment. (like PIXEL_CHANGE_LAMBDA)

And I still don't know why learning result with Atari was bad...

pengsun commented 7 years ago

Hmm... it's a bit weird, because it should work well with Atari game according to the paper. Thanks so much for your reply! Maybe you could share your experience when figuring it out:)

djl11 commented 7 years ago

Just to add to this discussion, I tried the unreal implementation with the following hyper-parameter settings on breakout, without the auxiliary tasks included:

`LOCAL_T_MAX = 5 # repeat step size for each thread, before global network update RMSP_ALPHA = 0.99 # decay parameter for RMSProp RMSP_MOMENTUM = 0.0 # momentum parameter for RMSProp RMSP_EPSILON = 0.1 # epsilon parameter for RMSProp PARALLEL_SIZE = 8 # parallel thread size

ENV_TYPE = 'gym' # 'lab' or 'gym' or 'maze' or 'pygame' ENV_NAME = 'Breakout-v0'

GAMMA = 0.99 # discount factor for rewards ENTROPY_BETA = 0.01 # entropy regurarlization constant PIXEL_CHANGE_LAMBDA = 0.001 # 0.01 ~ 0.1 for Lab, 0.0001 ~ 0.01 for Gym EXPERIENCE_HISTORY_SIZE = 2000 # Experience replay buffer size

USE_PIXEL_CHANGE = False USE_VALUE_REPLAY = False USE_REWARD_PREDICTION = False

MAX_TIME_STEP = 2 * 10*8 SAVE_INTERVAL_STEP = 200 1000

GRAD_NORM_CLIP = 40.0 # gradient norm clipping USE_GPU = True # To use GPU, set True

FORCE_LEARNING_RATE = True LEARNING_RATE = 1.1 * 10 ** -3 ` I wanted to see if I got similar results to those in DeepMind's original A3C paper, and so I used the original A3C hyperparameters where possible (the only hyperparameter not mentioned in the paper at all actully was RMSP_EPSILON), so I just used your default of 0.1. Also, I kept the batch size at 2000, as batches were not used in original A3C (nor was LSTM etc.) as far as I can tell.

As you've mentioned, the learning rate was extremely slow, reaching only a score of ~30 after 90 million training frames, compared to original AC3 which reached score of ~400 after 80 million steps on breakout for all thread sizes. Have you had the chance to try this unreal code on other atari games yet? Or do you think there might be something algorithmically flawed in this implementation of UNREAL?

It does seem very surprising that the network would perform so badly with these hyperparameter settings, which, as far as I can tell, essentially emulate the original A3C with just the addition of LSTM and replay buffer (please correct me if I'm wrong).

Otherwise, I will try other atari games and see if it performs any better for others!

Thanks for sharing this code by the way, it has proven to be extremely useful :)

Cheers, Dan

miyosuda commented 7 years ago

@djl11 Thank you for testing!!

I used to get similar results as yours with breakout.

One difference between this UNREAL implementation and original A3C is that input of this UNREAL implementation is RGB 3 channels, while input of original A3C is 4 channels of successive 4 frames of grayscale images.

I once tried 4 frame input for Atari with UNREAL implementation, but the score was bad. (like score 30 with breakout) (But at that time, I tested with auxiliary tasks enabled.)

I think I need to check more whether there is something wrong in my implementation or not.

Thank you for suggesting!

djl11 commented 7 years ago

Good point about the successive frames! Completely forgot about that! I will have a bit of a play around as well, and see if I can emulate DeepMind's performance on the Atari front.

miyosuda commented 7 years ago

@djl11 Thanks!

djl11 commented 7 years ago

I think the poor performance on breakout was likely due to the use of "Breakout-v0", as opposed to "BreakoutDeterministic-v0", the difference being that the deterministic version automatically includes a deterministic frame-skip, as used in the original Deep-Q and A3C methods. This thread explains the differences. Action repeats can also be incorporated, and this issue I found to be useful. Frame skips and action repeats were used in original A3C, so perhaps with the deterministic version, and your untouched UNREAL code, it will work better. I shall let you know if it does!

pengsun commented 7 years ago

Hi @djl11 , any updating of your experiment?

djl11 commented 7 years ago

Working on it today, Pong trained optimally in only a few hours last night, did this by using NoFrameskip-v3 environment, and including RGBtoGray and frame stacking functions, and changing the network to accept 4 channels (for frame stack) instead of 3 (for single frame RGB). Playing around with a few other parameters at the moment, will test Breakout later and submit a pull request if all working well.

djl11 commented 7 years ago

Tried my code on BreakoutNoFrameskip-v3, both with all auxiliary tasks turned off, and with them all turned on, and found that the training speed seemed to almost exponentially slow down over time while using all of the auxiliary tasks, achieving only 9 million frames after ~14 hours (averaging 0.64 million per hour).

screenshot from 2017-02-19 13-12-38

With the auxiliary tasks turned off, however, it managed 37 million in ~19 hours (averaging 1.94 million per hour).

screenshot from 2017-02-19 13-07-02

Very confused as to what is going on, as your untouched code (i.e. with 3-channel RGB images, and no frame stacking) doesn't show such a drastic slowdown while using all of the auxiliary tasks, achieving 4.3 million in ~2.5 hours (averaging 1.72 million per hour).

screenshot from 2017-02-19 14-35-43

Essentially all I changed in the model.py file was the input channels to the networks, from depth of 3 to 4, and made some changes in the gym_environment.py file. Feel free to have a look at the code if you want.

Unfortunately, I'm not going to be able to have a look at this again for a while now, got a lot of PhD work due very soon, but I hope to investigate further at some point. Sorry for not being able to help more!

miyosuda commented 7 years ago

@djl11 Thank you for reporting!

Very confused as to what is going on, as your untouched code doesn't show such a drastic slowdown while using all of the auxiliary tasks,

Hmm strange.. I'll take a look. Thanks for testing.

kkhetarpal commented 6 years ago

Could you please point to the commit where frame stacking is being used? @djl11 @miyosuda ? Thank you for your reply in advance.