rarilurelo / pytorch_a3c

38 stars 4 forks source link

expected a Variable arg but got numpy.ndarray error #3

Open dylanthomas opened 7 years ago

dylanthomas commented 7 years ago

I am new to pyTorch, just cloned your codes and ran them, but got an error. I hope you to point me to the right direction to fix this issue.

More specifics:

used conda env with python 3.6
Ran 'run_a3c.py' with the default Breakout-v0 env till the end and ran 'python test_a3c.py --render --monitor --env Breakout-v0'
got the below error message -

=== File "test_a3c.py", line 71, in test(policy, args) File "test_a3c.py", line 25, in test p, v = policy(o) ... File "/home/john/anaconda3/envs/th/lib/python3.6/site-packages/torch/nn/functional.py", line 37, in conv2d return f(input, weight, bias) if bias is not None else f(input, weight) RuntimeError: expected a Variable argument, but got numpy.ndarray

Could you tell me what could be the issue(s) here ?

Many thanks,

John

rarilurelo commented 7 years ago

torch.nn.module must take torch.Variable. But, policy(which is subclass of torch.nn.module) takes numpy.ndarray, so we have to convert numpy.ndarray to torch.Variable. I fixed this problem. See my commit 9e9fb687786a025061561c7260ba9b586e9ca4ce.

dylanthomas commented 7 years ago

Many thanks.

On another note, when I ran Breakout-v0, the reward that I got after 10M steps was 30~40M. But shouldn't this be around 400 according to the DeepMind's paper ? I wonder where the difference is coming from... Any thoughts/ insight on this ?

rarilurelo commented 7 years ago

There are some differences between my code and DeepMind's paper. My code is

no LSTM use
no gradient clipping
no hyper parameter tuning ( I couldn't find lr in the paper )

That's why the result was not good enough, I think.

dylanthomas commented 7 years ago

Thank you for your reply. Two points --

1. On the param setting, are you aware of this wiki ( https://github.com/muupan/async-rl/wiki ) ?

2. On the performance issue of tensorflow implementation, have you seen this discussion ( https://github.com/dennybritz/reinforcement-learning/issues/30 It's on dqn, but the same issues are supposed to be the root cause on the A3C side as well )

Here cgel suggests the following are the key :

Important stuff:

Normalise input [0,1] Clip rewards [0,1] don't tf.reduce_mean the losses in the batch. Use tf.reduce_max initialise properly the network with xavier init use the optimizer that the paper uses. It is not same RMSProp as in tf

Has your code incorporated all the points above ?

ethancaballero commented 7 years ago

@dylanthomas did you try running Breakout-v0 for longer than 10M timesteps to see if avg reward eventually got to >400? For example, it took Muupan's A3C https://github.com/muupan/async-rl#a3c-ff 20M timesteps to start getting to >400.

dylanthomas commented 7 years ago

Not yet, but I will run this code for 20M to see if it goes up to 400. @ethancaballero