miyosuda / async_deep_reinforce

Asynchronous Methods for Deep Reinforcement Learning
Apache License 2.0
592 stars 192 forks source link

Failing to fully replicate Pong with A3C-LSTM #15

Open revilokeb opened 8 years ago

revilokeb commented 8 years ago

@miyosuda I have been trying to replicate your very nice Pong A3C-LSTM chart (https://github.com/miyosuda/async_deep_reinforce/blob/master/docs/graph_24h_lstm.png). So far unfortunately I have not really succeeded.

I have been using parameter settings in constants.py (setting USE_LSTM=True and USE_GPU=True). I have also been setting frame_skip=4 in ale.cfg (using master branch, tf r0.10, i7-5930K@3.5GHz - 8 threads, Nvidia Titan X).

My question: should the above setting of parameters be sufficient to reproduce your A3C-LSTM chart?

When doing the above my charts are looking as follows (I have done multiple runs, also simulating for more than 60M steps): 160920-a3c_lstm_pong_miyosuda_master5 In my case learning seems to be much slower and saturating at around -5 to 0 (better seen on my simulations running for more than 60M steps).

Comparing with http://arxiv.org/abs/1602.01783 their Figure 3 (Pong, A3C - 8 threads) it seems your result is faster in terms of number of steps to reach score 20: you require ~20-25M steps to reach score 20, whereas DeepMind on average seemd to need ~50-60M (but then theirs is an average value and from their paper I dont know if it has been A3C-FF or A3C-LSTM).

My second question: Have you run A3C-FF / A3C-LSTM multiple times and were results similar? And do you have an explanation why A3C-FF is not reaching score 20? (I have also run A3C-FF which is looking similar to yours...)

Many many thanks for the code!!

miyosuda commented 8 years ago

@revilokeb Thank you for trying my code. I tried pong with master branch recently, and I couldn't reach score +20.0 like when I recorded the graph for A3C-LSTM.

When I recorded the graph, there were few differences compared with current master.

1) I was using TensorFlow r0.8 or r0.9. (I couldn't remember precisely 0.8 or 0.9, but it was not r0.10, because from r0.10 TensorBoards graph color became red from blue.) (The speed (frames per second) with r0.10 is now faster than the speed written in README.md on same machine, so something in TensorFlow changed from r0.10.) 2) I added log_pi clipping for avoiding NaN after recording that graph. https://github.com/miyosuda/async_deep_reinforce/commit/41f2d75499accb1b2073daae295603bb6d89f88e

The commit when I recorded the graph was this point. https://github.com/miyosuda/async_deep_reinforce/commit/9f97b2bddf7217684f2948b2d66453921134cf4b (I used 8 threads with USE_LSTM, and USE_GPU frag on.)

However I think, both 1) and 2) are not related with the score. I can't tell whether I was just super lucky when recorded the graph or not.

The video I recorded when recorded the A3C-LSTM after 24 hours was like this. https://www.youtube.com/watch?v=KJt1X-tRCbw

Now my machine is occupied with other task, so I'll try again once machine becomes available.

Thank you for the trial.

revilokeb commented 8 years ago

@miyosuda Thank you for your detailed reply. I might give https://github.com/miyosuda/async_deep_reinforce/commit/9f97b2bddf7217684f2948b2d66453921134cf4b another try later. In case I am going to find additional insights into this I will post them here, too.

danijar commented 8 years ago

@miyosuda That video seems like the agent memorized the environment. I think the paper authors use random starts to create a fairer evaluation. They sample a number from 0 to 20 and perform that many no-op activation at the beginning of that episode, before the agent kicks in.

joabim commented 8 years ago

@danijar The reset function game_state.py handles the no-ops (max number defined in constants.py) at each reset

danijar commented 8 years ago

I see, so the agent just learned a good behavior that results in very repetitive episodes.

giuseppebonaccorso commented 7 years ago

I've seen that you always pick a random action without an exploration factor. How can you reach a so high score without argmax? Are you still picking a random action when the graph is showing a score of 20? It's terribly weird and interesting! I think that using minimal actions (maybe 3) the chance to get the right one with a random choice is relatively high when considering a sequence (so that an error can be corrected), but I still don't understand how that result can be achieved.

babaktr commented 7 years ago

Hey @giuseppebonaccorso! It's not fully random, but rather based on a a weighted probability where the action with the highest value also has the highest probability of being selected (think soft max, almost) :)

giuseppebonaccorso commented 7 years ago

@babaktr You're right! I was looking at random.choice but without considering probabilities. :( It becomes almost deterministic if the entropy is enough low.

itane13 commented 7 years ago

Hey all,

I have been running the latest version after Miyoshi ported it to tf 1.0, I removed gradient clipping, to test on asteroids, where I was worried that it wasn't converging and I also tried with pong: I get this performance on pong: All settings as they are on the repo, except actions = 6 (ale.get_minimal_actionset)

Scores on pong

edit: as for asteroids... I think that the reason for non convergence is that the ship randomly (? ) disappears from the screen for an indefinite (?) number of frames.

mklissa commented 7 years ago

I can also confirm good results with tf 1.0 on a simple MacBook Pro 13"

1601214542 commented 6 years ago

hello, why I still stuck in -21 scores when step is 8M. I am confused. Is that ok to directly run a3c.py? I am using tensorflow 1.2