reproducing your stellar result on Montzuma's Revenge

dhfromkorea commented 7 years ago

Hi Steve, I am trying to reproduce the ~3600 score you achieved on Montezuma's Revenge with your dqn-cts model (as per the gif image on README).

With 30M steps counting, the model does not seem to learn. It very occasionally gets the key (+100 points) and that's all. I ran your code as it is and did not modify a single line.

1) Could I ask if you can reproduce 3600 "on average" with your dqn-cts?

2) Also would you say I should try some other hyperparameter settings other than the ones you set as default?

I look forward to your advice.

Best wishes,

steveKapturowski commented 7 years ago

@dhfromkorea I ran the agent several times and it usually would get close to that score. If it's not getting above 100 then there's something quite seriously wrong. May I ask precisely what command you are using to run the agent and what commit you are on?

dhfromkorea commented 7 years ago

@steveKapturowski I am running on HEAD of master branch w/ python2.7 tensorflow(cpu) 1.2.1

The command is: python2 main.py MontezumaRevenge-v0 --load_config config/dqn-cts.yaml -n 32

steveKapturowski commented 7 years ago

Can you try adding the following options: --q_target_update_steps=30000 --max_global_steps=160000000 --epsilon_annealing_steps=500000 --replay_size=500000 --clip_norm_type=ignore

The first 4 I'm suggesting mainly for consistency with my experiments; I suspect the norm clipping may be what's really killing performance

sangjin-park commented 7 years ago

Hello Steve. I'm SangJin. I'm with dhfromkorea. It still doesn't look reproducible.

cmd: python2 main.py MontezumaRevenge-v0 --load_config config/dqn-cts.yaml -n 12 \ --q_target_update_steps=30000 \ --max_global_steps=160000000 \ --epsilon_annealing_steps=500000 \ --replay_size=500000 \ --clip_norm_type=ignore --restore_checkpoint

git: master / bcc9b2a tensorflow-gpu==1.2.1

[2017-07-29 12:54:36,278] T2 / STEP 70243145 / REWARD 0.0 / Q_MAX 1.5947 / EPS 0.1000 [2017-07-29 12:54:36] INFO [MainThread:284] ID: 2 -- RUNNING AVG: 9 ± 90 -- BEST: 400 [2017-07-29 12:54:36,278] ID: 2 -- RUNNING AVG: 9 ± 90 -- BEST: 400 [2017-07-29 12:54:44] INFO [MainThread:279] T4 / STEP 70246725 / REWARD 0.0 / Q_MAX 1.8437 / EPS 0.0100 [2017-07-29 12:54:44,166] T4 / STEP 70246725 / REWARD 0.0 / Q_MAX 1.8437 / EPS 0.0100 [2017-07-29 12:54:44] INFO [MainThread:284] ID: 4 -- RUNNING AVG: 14 ± 98 -- BEST: 400 [2017-07-29 12:54:44,167] ID: 4 -- RUNNING AVG: 14 ± 98 -- BEST: 400 [2017-07-29 12:55:03] INFO [MainThread:279] T3 / STEP 70256320 / REWARD 0.0 / Q_MAX 1.5227 / EPS 0.2000 [2017-07-29 12:55:03,665] T3 / STEP 70256320 / REWARD 0.0 / Q_MAX 1.5227 / EPS 0.2000 [2017-07-29 12:55:03] INFO [MainThread:284] ID: 3 -- RUNNING AVG: 28 ± 179 -- BEST: 400

If you are interested, we could give you access to the server running the agent, maybe we could find out what's wrong together.

Best Regards,

steveKapturowski commented 7 years ago

Hi @sangjin-park, I'd be happy to try to debug what's going on in the server but first could you try running on the commit 39e695696488df83bf6d08a1eb7df0ff4ebd109c and tell me if there's any difference?

sangjin-park commented 6 years ago

Hi I tried 452d57 and it looks ok.

Thanks!

steveKapturowski commented 6 years ago

I'm going to check the diff between commit 452d57 and master to see what went wrong and get a fix out asap

steveKapturowski commented 6 years ago

@sangjin-park I was checking out commit 452d5735551c672e2ce44740514b105cb045075e and noticed something funny: the ordering of the context window is backwards which I would expect to hurt performance https://github.com/steveKapturowski/tensorflow-rl/blob/452d5735551c672e2ce44740514b105cb045075e/utils/fast_cts.pyx#L305-L308 as compared to the ordering in commit 39e695696488df83bf6d08a1eb7df0ff4ebd109c: https://github.com/steveKapturowski/tensorflow-rl/blob/39e695696488df83bf6d08a1eb7df0ff4ebd109c/utils/fast_cts.pyx#L305-L308

Did you produce your OpenAI gym evaluation from the former commit?

sangjin-park commented 6 years ago

My branch's window order is the former one.

context[0] = obs[i, j-1] if j > 0 else 0 context[1] = obs[i-1, j] if i > 0 else 0 context[2] = obs[i-1, j-1] if i > 0 and j > 0 else 0 context[3] = obs[i-1, j+1] if i > 0 and j < self.width-1 else 0

steveKapturowski / tensorflow-rl

reproducing your stellar result on Montzuma's Revenge #18