ikostrikov / pytorch-a3c

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
MIT License
1.23k stars 279 forks source link

Performance with Breakout #3

Closed dylanthomas closed 7 years ago

dylanthomas commented 7 years ago

Have you trained Breakout with your a3c by any chance? I wonder that kind of scores you have gotten.

John

ikostrikov commented 7 years ago

It's not as good as DeepMind's implementation.

After several hours of training it gets reward around 300 and stops there.

pfrendl commented 7 years ago

Have you tried the rmsprop optimizer with shared parameters (this is what the authors use) instead of adam?

ikostrikov commented 7 years ago

No, but it should be relatively easy to try.

dylanthomas commented 7 years ago

On this issue, are you aware of this discussion (https://github.com/dennybritz/reinforcement-learning/issues/30) ?

[It's on dqn / tensorflow performance issue but the guess is that a3c tensorflow's performance issue has the same causes ]

Here cgel suggests the following makes the difference -

Important stuff:

Normalise input [0,1] Clip rewards [0,1] don't tf.reduce_mean the losses in the batch. Use tf.reduce_max initialise properly the network with xavier init use the optimizer that the paper uses. It is not same RMSProp as in tf

Not really sure how important:

They count steps differently. If action repeat is 4 then they count 4 steps for action. So divide all pertinent hyper-parameters by 4.

Little difference (at least in breakout):

pass terminal flag when life is lost gym vs alewrap. Learning rate is different but If one works so will the other

Among important stuff, what are incorporated into your code ?

ikostrikov commented 7 years ago

Everything except the optimizer. But I posted a link to the same one as in DM's paper in the description of the repo. At the moment, I'm working on a different project and don't have time to try the correct one but I will gladly accept a pull request :)

Also from their discussion it looks like there is a typo here, and they mean reduce_sum instead of reduce_max.

dylanthomas commented 7 years ago

Wonderful. Thank you.
Just one more question. In terms of params settings, are they same as this https://github.com/muupan/async-rl/wiki ?

ikostrikov commented 7 years ago

No, I decided to use parameters from the open ai starter agent.

dylanthomas commented 7 years ago

oki doki. Many thanks.

ghost commented 7 years ago

@ikostrikov have you tried to somehow use your meta-optimizer inside train.py, and somehow initialize and share it's optimizer, from main.py. Just an idea that I thought you might have tried for Pong

ikostrikov commented 7 years ago

Not yet, I may try in the future. In my experience, for a fixed model and a fixed dataset meta optimizer tends to overfit. However, it's probably not a problem for atari.

ghost commented 7 years ago

@ikostrikov thanks for the heads up on how the meta-optimiser performs !!!

Regarding models, this looks promising, XNOR-Net. As far as I know it's not been ported over to PyTorch yet, (they released it in Torch last year) ??? But if I do convert it, I'll let you know if I get it working with your meta-optimiser, and how it performs.

ikostrikov commented 7 years ago

@dylanthomas

Update: After 10h with 16 threads it achieves reward of >400.

dylanthomas commented 7 years ago

@ikostrikov Super !!

IbrahimSobh commented 7 years ago

A3C and Breakout:

How did you get reward > 400? (using the same code or did you make some changes?)

I want to run some code and get > 400 rewards, what should I do?

Regards

ikostrikov commented 7 years ago

The same code.

10 hours with 16 threads on xeon 2650 v4.

IbrahimSobh commented 7 years ago

Thank you

How many frames did you process in 10 hours?

I will clone the code again from:

https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient/a3c

and try .... correct?

This should replicate DeepMind's results, correct?

ikostrikov commented 7 years ago

The numbers are for my code, not from that repo.

I'm not sure whether it's physically possible to replicate DeepMind's results.

IbrahimSobh commented 7 years ago

I am very sorry for asking many questions ...

How many frames did you process in 10 hours?

Is this your code? https://github.com/ikostrikov/pytorch-a3c

Do you have any clue why this repo do not work as expected?! (rewards are around 30 to 35 for Breakout?!)

Why you are not sure whether it's physically possible to replicate DeepMind's results.

ikostrikov commented 7 years ago

I didn't count the number of frames.

Yes.

Which one? If you mean the one referenced above then I don't know. It's extremely difficult to get good results from A3C.

Because A3C is extremely sensitive to hyper parameters (even to random seed). DeepMind ran massive grid search to find the best hyper parameters. Then in evaluation they run 50 trials with fixed hyper parameters for each game and average top 5 performances. It's rather difficult to replicate that.

ypxie commented 7 years ago

@dylanthomas , I find the current repo does not learn as expected. Did you make it work?

Time 00h 08m 56s, episode reward -2.0, episode length 106
Time 00h 09m 58s, episode reward -2.0, episode length 111
Time 00h 11m 04s, episode reward -2.0, episode length 113
Time 00h 12m 08s, episode reward -2.0, episode length 104
Time 00h 13m 13s, episode reward -2.0, episode length 111
Time 00h 14m 17s, episode reward -2.0, episode length 107
Time 00h 15m 22s, episode reward -2.0, episode length 110
Time 00h 16m 26s, episode reward -2.0, episode length 105
Time 00h 17m 31s, episode reward -2.0, episode length 104
Time 00h 18m 37s, episode reward -3.0, episode length 156
Time 00h 19m 44s, episode reward -3.0, episode length 156
Time 00h 21m 13s, episode reward -21.0, episode length 764
Time 00h 22m 43s, episode reward -21.0, episode length 764
Time 00h 24m 07s, episode reward -21.0, episode length 764
Time 00h 25m 15s, episode reward -4.0, episode length 179
Time 00h 26m 44s, episode reward -21.0, episode length 764
Time 00h 28m 02s, episode reward -11.0, episode length 425
Time 00h 29m 36s, episode reward -21.0, episode length 764
Time 00h 31m 29s, episode reward -21.0, episode length 1324
Time 00h 32m 58s, episode reward -21.0, episode length 764
Time 00h 34m 30s, episode reward -21.0, episode length 764
Time 00h 36m 01s, episode reward -21.0, episode length 764
Time 00h 37m 30s, episode reward -21.0, episode length 764
Time 00h 39m 32s, episode reward -21.0, episode length 1324
ikostrikov commented 7 years ago

How many threads did you use?

ypxie commented 7 years ago

@ikostrikov Thanks for your quick reply. I am using 8 threads

ikostrikov commented 7 years ago

Did you run with OMP_NUM_THREADS=1 ?

ypxie commented 7 years ago

@ikostrikov Wow, maybe that's the reason. I didn't notice this. Why is this important?

ikostrikov commented 7 years ago

Otherwise it uses multiple cores for OMP within a thread.

ypxie commented 7 years ago

Do you think it is necessary to add:

model.zero_grad()

at https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L108 ?

ypxie commented 7 years ago

@ikostrikov Thanks!, I will try it again~

ikostrikov commented 7 years ago

optimizer.zero_gradient() already zeros the gradients.

ypxie commented 7 years ago

@ikostrikov but it only zeros the gradient of the shared_model right?

ikostrikov commented 7 years ago

https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L18 They share gradients within this thread.

ypxie commented 7 years ago

@ikostrikov many thanks!
But I still do not get it why using multiple cores for OMP within a thread will be a problem. Won't that make the algorithm faster?

ikostrikov commented 7 years ago

It will make some threads sequential and you will effectively collect less data.

ypxie commented 7 years ago

@ikostrikov I see, do you mean that all the threads are very likely to end up processing the same frame, thus useless?

ikostrikov commented 7 years ago

I think it happens for many reasons. Just try to run it this way :)

ypxie commented 7 years ago

Thanks! this really bothers me in the last two days. Do you think this issue is with pytorch or general python code? Cause I saw many tensorflow implementations do not pose this constraint.

ikostrikov commented 7 years ago

I think it's just the way multiprocessing is organized in PyTorch. I think the authors of PyTorch have a better answer. I also found this thing surprising.

ypxie commented 7 years ago

Updates, it starts learning this time :D You saved my day. Thank you! @apaszke Is it true that OMP_NUM_THREADS=1 is a necessary thing to run multi-thread pytorch code? Thanks!

ypxie commented 7 years ago

@ikostrikov When doing the optimizer.step(), I notice that there is no mutex to protect the shared_model weights, do you think it is safe?

ikostrikov commented 7 years ago

In DM's paper they say they perform async updates without locks.

ypxie commented 7 years ago

Oh I see, thanks!

ypxie commented 7 years ago

Will the following code restrain the shard_model grad only bound with one local_model? Cause share_model.grad will not be None after running the following function.

def ensure_shared_grads(model, shared_model):
    for param, shared_param in zip(model.parameters(), shared_model.parameters()):
        if shared_param.grad is not None:
            return
        shared_param._grad = param.grad