miyosuda / async_deep_reinforce

Asynchronous Methods for Deep Reinforcement Learning
Apache License 2.0
592 stars 192 forks source link

Problem while using the code #1

Open originholic opened 8 years ago

originholic commented 8 years ago

Hello @miyosuda,

Thanks for sharing the code, please ignore the title, I tried out your code with the control problem of cartpole balance experiment instead of Atari game, it works well. But few questions want to ask.

I am curious, in the asynchronous paper, they also used another model implementation with 1 linear layer, 1 LSTM, layer, and softmax output, I am thinking of using this model to see whether improve the result, can you suggest how the LSTM can be implemented using tensorflow in the case of playing atari game?

Also wondering that the accumulated states and reward were reversed, do you need to reverse the actions and values as well? Although it did not make any different when I tried out, just wondering why.

states.reverse()
rewards.reverse()

Last, do you really need to accumulate the gradient and then apply the update, since tensorflow can handle the 'batch' for update.

miyosuda commented 8 years ago

There was a progress with learning.

Today I've changed some parameters.

t_max = 5 RMSprop epsilon = 0.1

and I've changed loss function a bit and the gradient of critic is now half of before, following muupan's setting. (Learning rate of critic was already being half of actor's but I make it more half of before.)

https://github.com/miyosuda/async_deep_reinforce/commit/40c65d433d0e5fdd7b909319b3d019b7e4d54ac0

tf.nn.l2_loss() means

output = sum(t ** 2) / 2

and I used to use loss function like below before.

output = sum(t ** 2)

I'm not confident now, but the learning rate ratio of actor/critic might be important.

I'be been running only 13 hours, but the score began increasing with this setting.

screen_shot

I'll keep running this setting to see the result.

joabim commented 8 years ago

@miyosuda I have started the same experiment (from current master) on a single Intel Xeon e5-2680 (however with PARALLEL_SIZE of 8 as in your implementation). I have two Xeons I can utilize, is there any setup I should go for that you would like to see?

miyosuda commented 8 years ago

@joabim It's great to hear that you have Xeon machine. Xeon e5-2680 has 8cores and can run 16 threads, correct? So could you try PARALLEL_SIZE of 16?

I have two Xeons I can utilize,

Does this mean you have two PCs and each PC has single Xeon chip?

Anyway I would like you to try 16 parallel threads to check parallel size increment will speed up learning!

As far as I'm testing with current master setting, the result in 38 hours (18 million steps) is like this

capture3

And the movie capture after 24 hours learning was like this

https://www.youtube.com/watch?v=cFWL_y9BVaQ

joabim commented 8 years ago

@miyosuda It's even the third revision (12 cores + 12 virt), I am lucky to be able to borrow this machine! Hence, 16 threads should not be a problem, I have just reset the test for parallel size 16, let's see how it performs!

Yes, the second xeon is in another PC so I can't promise anything, but it might be possible to run a parallel test if we use a parallel size >12 - perhaps two runs of 20 threads or so.

The results are promising, we start to converge! I think we can reach article-grade results (~20 points after 10 hours of training) soon

miyosuda commented 8 years ago

@joabim

It's even the third revision (12 cores + 12 virt),

Wow so great!

I'm curious about what happens if we run it with 24 threads. When you run with 16 or 24 threads, is it possible to see the CPU usage of each CPUs?

When I run this program with 8 threads, CPU usage of each CPU was around 80%.

system_monitor

As @originholic suggests, multi threading in Python seems to be not efficient due to Global Interpreter Lock (GIL).

Most of the learning process is calculated inside TensorFlow with c++, so I'm not sure how much GIL affects the performance, but if the CPU usages is low when we increase thread size, I need to replace threading module which I'm using now to multiprocessing module.

So I'm happy if I can see the CPU usage in you environment. Thanks!

joabim commented 8 years ago

This is the current CPU usage (about 85% per core)

cpuusg

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads? Sometimes the current high score is broken by a thread that might not be of index 0 which results in the summary_writers not recording the current "overall" performance of the agents.

miyosuda commented 8 years ago

@joabim Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads?

I was thinking the same thing. I've added modification to record scores from all threads and pushed it to "all_scores" branch. I'll test this branch, and if there is no problem, I'll merge it to master.

(I'm not averaging the score, but does this help?)

ghost commented 8 years ago

I think we should follow the epoch convention for testing, and only with respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters and periodically test only the global network. That is what has been done in the paper. Every 4 million frames (1 mill steps - value of T), a testing epoch must be conducted that will last 500000 frames (125000 steps). I think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi notifications@github.com wrote:

@joabim https://github.com/joabim Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads?

I was thinking the same thing. I've added modification to record scores from all threads and pushed it to "all_scores" branch. I'll test this branch, and if there is no problem, I'll merge it to master.

(I'm not averaging the score, but does this help?)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/miyosuda/async_deep_reinforce/issues/1#issuecomment-217464942

Aravind

ghost commented 8 years ago

We also don't need to output the thread score (even if it is only thread 0) during train phase.

This is just following the same convention as DQN in DeepMind's code or Nathan Sprague's Lasagne implementation.

On Fri, May 6, 2016 at 8:46 PM, Aravind Srinivas L < aravindsrinivas@gmail.com> wrote:

I think we should follow the epoch convention for testing, and only with respect to the global network parameters. Not the thread parameters.

That is, we must train using all threads and update the global parameters and periodically test only the global network. That is what has been done in the paper. Every 4 million frames (1 mill steps - value of T), a testing epoch must be conducted that will last 500000 frames (125000 steps). I think this will make it better... What do you think?

On Fri, May 6, 2016 at 8:26 PM, Kosuke Miyoshi notifications@github.com wrote:

@joabim https://github.com/joabim Thanks! As far as I see your result, CPU usages seems ok. I'll stay multi threading at this moment.

Regarding the score recording for tensorboard, couldn't we record the average of the 16 threads?

I was thinking the same thing. I've added modification to record scores from all threads and pushed it to "all_scores" branch. I'll test this branch, and if there is no problem, I'll merge it to master.

(I'm not averaging the score, but does this help?)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/miyosuda/async_deep_reinforce/issues/1#issuecomment-217464942

Aravind

Aravind

originholic commented 8 years ago

@miyosuda Very glad to hear that have learning going on with the ale environment!! Also I agree with the idea of @aravindsrinivas, how about we add one more thread specified for testing the global net, that 16 training thread + 1 validation thread, so just need to monitor the validation thread for the score according to the global T.

If using multiprocessing module, I think we can make the validation thread to handle all the I/O scheduling to speed up the training, I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

miyosuda commented 8 years ago

@aravindsrinivas @originholic Thank you for the suggestion. Let me think about how to do validation with global network efficiently.

I am currently working on the multiprocessing module since I want to improve the training speed of the continuous domain as well.

Great! If the performance increases, let me know! I think 80% of current CPU usage has a room to improve.

joabim commented 8 years ago

Alright, now I'm back and I get that we need to change the implementation. The results from my run during the weekend follows (even though they don't matter anymore)

a3c2

miyosuda commented 8 years ago

@joabim Thank you for testing.

In my environment the result on my machine during this 4 days is

4days

and the step size was around 53 million. (The learning rate becomes zero after 60 million steps) Hmm what's the difference between your environment and mine...

Let try one more learning to check.

miyosuda commented 8 years ago

I didn't noticed, but I found that, in muupan's project,

https://github.com/muupan/async-rl

the learning rates of actor/critic are opposite to mine. The learning rate of actor is half of critic's, and his result is better than mine.

I'll try his setting in this branch.

https://github.com/miyosuda/async_deep_reinforce/tree/muupan_lr_setting

miyosuda commented 8 years ago

After learning of 59 million steps, I visualized the weights of first convolution layer.

$ python a3c_visualize.py

The result was like this.

weights_with_note

I think the second column represents the upside movement of the paddle. (One column represents 4 frames of the input.)

joabim commented 8 years ago

I realized I forgot to install the ale in my new anaconda environment after building it (the ALE fork that you made with correct pong support and multithreading) in my previous run which resulted in me using the incorrect version of ALE... I am redoing the test for 16 threads using the muupan_lr_setting branch now! I'll let you know how it goes

miyosuda commented 8 years ago

@joabim I see. BTW, I'm going to ask muupan about his setting in his issues thread. He said that he asked DeepMind authors about tuning, and I hope I can apply his feedback to mine later.

muupan commented 8 years ago

@miyosuda For everyone's information, I summarized about their settings here: https://github.com/muupan/async-rl/wiki

miyosuda commented 8 years ago

@muupan Thank you!

cjratcliff commented 8 years ago

@miyosuda Hi. I’ve got an LSTM working with your code. I’ve only tested it on a toy problem (4 state MDP) rather than an Atari game but it seems to be working properly and as well as the feedforward net does. The code is at https://github.com/cjratcliff/async_deep_reinforce. I’ve made quite a few changes for my own version, many of them outside the LSTM parts so I’m happy to answer any questions. For using it on Atari, in addition to increasing the RNN size, I’d recommend changing the cell type from BasicRNNCell to BasicLSTMCell and removing the activation function argument to that function.

miyosuda commented 8 years ago

@cjratcliff Thank you for sharing your LSTM version!!! Let me try it!!!

miyosuda commented 8 years ago

@cjratcliff I've pushed my LSTM version. To make pong work with LSTM, I added

1) Unrolling LSTM cell up to 5 time steps (LOCAL_T_MAX time steps). Now the back-prop calculation is batched with unrolled LSTM. 2) Call actions.reverse(), states.reverse() etc.... again to change the input order as normal. When calculating "R", I'm calling reverse() to make the calculation easier. (Because from the last state, R can be calculated recursively as written in the original paper.) So I called reverse() again to fix the order.

With LSTM, the score of pong hit the maximum score easily. Thanks.

cjratcliff commented 8 years ago

@miyosuda Great to see it working so well, thanks.

Itsukara commented 8 years ago

@miyosuda Your work is great! I tried your program on a game "Breakout". In one day training, I got 833 point as maximum score.

BTW, in training time, I encountered some trouble. "pi" become NaN sometimes and the saved data was not usable for demo play.

The reason is that the "pi" has a possibility to become 0.0 and your code does not treat it correctly. I think that you'd better to change following code in the file "game_ac_network.py". I changed the code as follows and has no problems so far.

entropy = -tf.reduce_sum(self.pi * tf.log(self.pi), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(self.pi), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )
entropy = -tf.reduce_sum(self.pi * tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), reduction_indices=1)
policy_loss = - tf.reduce_sum( tf.reduce_sum( tf.mul( tf.log(tf.clip_by_value(self.pi, 1e-20, 1.0)), self.a ), reduction_indices=1 ) * self.td + entropy * entropy_beta )
miyosuda commented 8 years ago

@Itsukara Sorry for late in reply (I didn't notice your post until now), and thank you for suggestion! As you suggest, my code can't treat zero pi value. I'll test it and apply your fix to my repo later. Thanks!

sahiliitm commented 8 years ago

The code performs really well on some games but on others it doesn't quite get the same level of scores as those reported in the paper. I wonder why that is. For example in Space Invaders, the reported score is 23846.0. The model I trained comes nowhere near that. :( Did anyone else manage to get better than around 1500 for Space Invaders?

mw66 commented 7 years ago

Just saw some discussion on using Multi-processing in this thread, I wonder what's the current status?

I opened a dedicated ticket on this:

https://github.com/miyosuda/async_deep_reinforce/issues/27

dongleecsu commented 7 years ago

Hi @miyosuda , thanks for sharing the code. I have a question about A3C LSTM implementation.

At class GameACLSTMNetwork, line 217, why to share the lstm weights among thread? Maybe it makes sense to create "no reuse" lstm weights for every worker and global_network and synchronize all the variables from global_network's.

Thanks!