Closed dylanthomas closed 7 years ago
It's not as good as DeepMind's implementation.
After several hours of training it gets reward around 300 and stops there.
Have you tried the rmsprop optimizer with shared parameters (this is what the authors use) instead of adam?
No, but it should be relatively easy to try.
On this issue, are you aware of this discussion (https://github.com/dennybritz/reinforcement-learning/issues/30) ?
[It's on dqn / tensorflow performance issue but the guess is that a3c tensorflow's performance issue has the same causes ]
Important stuff:
Normalise input [0,1] Clip rewards [0,1] don't tf.reduce_mean the losses in the batch. Use tf.reduce_max initialise properly the network with xavier init use the optimizer that the paper uses. It is not same RMSProp as in tf
Not really sure how important:
They count steps differently. If action repeat is 4 then they count 4 steps for action. So divide all pertinent hyper-parameters by 4.
Little difference (at least in breakout):
pass terminal flag when life is lost gym vs alewrap. Learning rate is different but If one works so will the other
Among important stuff, what are incorporated into your code ?
Everything except the optimizer. But I posted a link to the same one as in DM's paper in the description of the repo. At the moment, I'm working on a different project and don't have time to try the correct one but I will gladly accept a pull request :)
Also from their discussion it looks like there is a typo here, and they mean reduce_sum instead of reduce_max.
Wonderful. Thank you.
Just one more question. In terms of params settings, are they same as this
https://github.com/muupan/async-rl/wiki ?
No, I decided to use parameters from the open ai starter agent.
oki doki. Many thanks.
@ikostrikov have you tried to somehow
use your meta-optimizer inside train.py
, and somehow
initialize and share it's optimizer, from main.py
. Just an idea that I thought you might have tried for Pong
Not yet, I may try in the future. In my experience, for a fixed model and a fixed dataset meta optimizer tends to overfit. However, it's probably not a problem for atari.
@ikostrikov thanks for the heads up on how the meta-optimiser performs !!!
Regarding models, this looks promising, XNOR-Net. As far as I know it's not been ported over to PyTorch yet, (they released it in Torch last year) ??? But if I do convert it, I'll let you know if I get it working with your meta-optimiser, and how it performs.
@dylanthomas
Update: After 10h with 16 threads it achieves reward of >400.
@ikostrikov Super !!
A3C and Breakout:
How did you get reward > 400? (using the same code or did you make some changes?)
I want to run some code and get > 400 rewards, what should I do?
Regards
The same code.
10 hours with 16 threads on xeon 2650 v4.
Thank you
How many frames did you process in 10 hours?
I will clone the code again from:
https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient/a3c
and try .... correct?
This should replicate DeepMind's results, correct?
The numbers are for my code, not from that repo.
I'm not sure whether it's physically possible to replicate DeepMind's results.
I am very sorry for asking many questions ...
How many frames did you process in 10 hours?
Is this your code? https://github.com/ikostrikov/pytorch-a3c
Do you have any clue why this repo do not work as expected?! (rewards are around 30 to 35 for Breakout?!)
Why you are not sure whether it's physically possible to replicate DeepMind's results.
I didn't count the number of frames.
Yes.
Which one? If you mean the one referenced above then I don't know. It's extremely difficult to get good results from A3C.
Because A3C is extremely sensitive to hyper parameters (even to random seed). DeepMind ran massive grid search to find the best hyper parameters. Then in evaluation they run 50 trials with fixed hyper parameters for each game and average top 5 performances. It's rather difficult to replicate that.
@dylanthomas , I find the current repo does not learn as expected. Did you make it work?
Time 00h 08m 56s, episode reward -2.0, episode length 106
Time 00h 09m 58s, episode reward -2.0, episode length 111
Time 00h 11m 04s, episode reward -2.0, episode length 113
Time 00h 12m 08s, episode reward -2.0, episode length 104
Time 00h 13m 13s, episode reward -2.0, episode length 111
Time 00h 14m 17s, episode reward -2.0, episode length 107
Time 00h 15m 22s, episode reward -2.0, episode length 110
Time 00h 16m 26s, episode reward -2.0, episode length 105
Time 00h 17m 31s, episode reward -2.0, episode length 104
Time 00h 18m 37s, episode reward -3.0, episode length 156
Time 00h 19m 44s, episode reward -3.0, episode length 156
Time 00h 21m 13s, episode reward -21.0, episode length 764
Time 00h 22m 43s, episode reward -21.0, episode length 764
Time 00h 24m 07s, episode reward -21.0, episode length 764
Time 00h 25m 15s, episode reward -4.0, episode length 179
Time 00h 26m 44s, episode reward -21.0, episode length 764
Time 00h 28m 02s, episode reward -11.0, episode length 425
Time 00h 29m 36s, episode reward -21.0, episode length 764
Time 00h 31m 29s, episode reward -21.0, episode length 1324
Time 00h 32m 58s, episode reward -21.0, episode length 764
Time 00h 34m 30s, episode reward -21.0, episode length 764
Time 00h 36m 01s, episode reward -21.0, episode length 764
Time 00h 37m 30s, episode reward -21.0, episode length 764
Time 00h 39m 32s, episode reward -21.0, episode length 1324
How many threads did you use?
@ikostrikov Thanks for your quick reply. I am using 8 threads
Did you run with OMP_NUM_THREADS=1 ?
@ikostrikov Wow, maybe that's the reason. I didn't notice this. Why is this important?
Otherwise it uses multiple cores for OMP within a thread.
Do you think it is necessary to add:
model.zero_grad()
at https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L108 ?
@ikostrikov Thanks!, I will try it again~
optimizer.zero_gradient() already zeros the gradients.
@ikostrikov but it only zeros the gradient of the shared_model right?
https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L18 They share gradients within this thread.
@ikostrikov many thanks!
But I still do not get it why using multiple cores for OMP within a thread will be a problem. Won't that make the algorithm faster?
It will make some threads sequential and you will effectively collect less data.
@ikostrikov I see, do you mean that all the threads are very likely to end up processing the same frame, thus useless?
I think it happens for many reasons. Just try to run it this way :)
Thanks! this really bothers me in the last two days. Do you think this issue is with pytorch or general python code? Cause I saw many tensorflow implementations do not pose this constraint.
I think it's just the way multiprocessing is organized in PyTorch. I think the authors of PyTorch have a better answer. I also found this thing surprising.
Updates, it starts learning this time :D You saved my day. Thank you! @apaszke Is it true that OMP_NUM_THREADS=1 is a necessary thing to run multi-thread pytorch code? Thanks!
@ikostrikov When doing the optimizer.step(), I notice that there is no mutex to protect the shared_model weights, do you think it is safe?
In DM's paper they say they perform async updates without locks.
Oh I see, thanks!
Will the following code restrain the shard_model grad only bound with one local_model? Cause share_model.grad will not be None after running the following function.
def ensure_shared_grads(model, shared_model):
for param, shared_param in zip(model.parameters(), shared_model.parameters()):
if shared_param.grad is not None:
return
shared_param._grad = param.grad
Have you trained Breakout with your a3c by any chance? I wonder that kind of scores you have gotten.
John