tambetm / simple_dqn

Simple deep Q-learning agent.
MIT License
692 stars 184 forks source link

The test score is different from the DeepMind paper #32

Open futurecrew opened 8 years ago

futurecrew commented 8 years ago

Hi, Thank you for the great project.

While testing simple_dqn I found the test score of simple_dqn is different from the DeepMind paper.

The DeepMind paper 'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf) shows learning curves of DQN in Figure 7. The gray is the original DQN according to the paper.

When comparing the curves in the paper with the attached png files in simple_dqn/results folder, the test score is somewhat different.

In Breakout the paper says the original DQN reaches around score 320 but the simple_dqn doesn't. Also in Seaquest the paper says the original DQN reaches more than score 3000 but the simple_dqn doesn't.

The paper doesn't say much about test code or environment of the original DQN result. Also I'm not sure the result of the paper is using the following DeepMind code or not. https://sites.google.com/a/deepmind.com/dqn/

Do you have any idea why there are score differences between simple_dqn and the DeepMind paper?

Thank you

mthrok commented 8 years ago

AFAIK, RMSProp optimizer implementation and screen preprocessing are different from the paper.

In the code, RMSProp is implemented here, which is similar to RMSProp by A.Graves (see page 23, eq 40, parameters are not same as DQN paper).

simple_dqn uses Neon RMSprop. Notice that epsilon appears twice and denominator is no subtracted by the square of the mean gradient.

The difference in screen preprocessing is mentioned here. Simple DQN uses averaged frame among skipped frame (which is ALE's built-in functionality), instead of max values from successive two frames as the paper.

Correct me if I am wrong.

tambetm commented 8 years ago

Yes @mthrok, these are some of the differences. But I'm not sure how much they matter, for example you can easily switch from RMSProp to Adam, which seems to be the preferred optimization method these days.

Another important difference is that DeepMind considers loss of life as episode end, something I don't do yet. I would expect more substantial differences from that, but who knows.

There is also discussion about matching DeepMind's result in deep-q-learning list: https://groups.google.com/forum/#!topic/deep-q-learning/JV384mQcylo

Keeping this issue open till we figure out the differences.

futurecrew commented 8 years ago

I got a similar score to the DeepMind paper in kung_fu_master after taking loss of life as the terminal state. Previously, with the original simple_dqn source, kung_fu_master showed much lower score.

I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind. Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master. With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.

mthrok commented 8 years ago

@tambetm

Just to clarify, the current implementation does not store test experience, right? README still says it stores, https://github.com/tambetm/simple_dqn#known-differences, but 1b9ceac seems to fix it.

futurecrew commented 8 years ago

Two more things to get close to the DeepMind paper.

1) DeepMind code copies the trained net to the target net once every 10,000 steps. DeepMind uses 10,000 steps for env steps not for train steps. After using the steps for env steps I could get much faster learning curves. (maybe also higher test scores)

2) DeepMind uses a fan_in parameter initializer while simple_dqn uses Gaussian. After using a Xavier initializer (which is similar to fan_in) I could get faster learning curves and higher test scores than the Gaussian.

initializer = Xavier()
layers = [Conv(fshape=(8, 8, 32), strides=4, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(4, 4, 64), strides=2, init=initializer, bias=initializer, activation=Rectlin()),
    Conv(fshape=(3, 3, 64), strides=1, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=512, init=initializer, bias=initializer, activation=Rectlin()),
    Affine(nout=num_actions, init=initializer, bias=initializer)
]

With the two changes plus one previous fix, 'taking loss of life as game over for train', I could get to the almost similar test scores to the DQN in Figure 7 of the following paper in kung_fu_master up to 32 epochs. 'Prioritized Experience Replay' (http://arxiv.org/pdf/1511.05952v3.pdf)

mthrok commented 8 years ago

@only4hj

Good job!

Just to clarify

1), 10,000 env steps is equivalent to 2,500 observation from agent's view point when skip_frame=4, right?

2) Can you point me to the initialization part of the code? Is it default initialization torch nn or custom initialization?

BTW, what value of repeat_action_probability are you using?

mthrok commented 8 years ago

@only4hj In the original DQN paper, target network update frequency is described as The frequency (measured in the number of parameter updates) with which the target network is update ...

And it seems to me that the corresponding code tracks the number of training steps. Nevermind, you are right, it's the number of perception, I was misreading it, sorry.

tambetm commented 8 years ago

Thanks @only4hj and @mthrok for wonderful analysis, I included bits of it in README. I would be happy to merge any pull requests regarding this. Especially target network interval and Xavier initialization seem like trivial fixes.

futurecrew commented 8 years ago

@mthrok 1) 10,000 env steps means 10,000 observation from agent's view point in skip_frame=4 (which means 40,000 ale frames).

2) I think it's torch's default initialization. See these. https://groups.google.com/d/msg/deep-q-learning/JV384mQcylo/De1Jzc0hAAAJ https://github.com/gtoubassi/dqn-atari/blob/master/dqn.py#L148

Regarding repeat_action_probability, I didn't use frame_skip of ale. Instead skipped frames in my implementation as following with frame_repeat = 4. This works just like repeat_action_probability = 0.

https://github.com/only4hj/DeepRL/blob/master/deep_rl_player.py#L165

def doActions(self, actionIndex, mode):
    action = self.legalActions[actionIndex]
    reward = 0
    lostLife = False 
    lives = self.ale.lives()
    for f in range(self.settings['frame_repeat']):
        reward += self.ale.act(action)
        gameOver = self.ale.game_over()
        if self.ale.lives() < lives or gameOver:
            lostLife = True

            if mode == 'TRAIN' and self.settings['lost_life_game_over'] == True:
                gameOver = True

            break
    state = self.getScreenPixels()

return reward, state, lostLife, gameOver
mthrok commented 8 years ago

@only4hj

Thank you for clarifying. Now I understand that frame skip is processed in their alewrapper and invisible to agent.

I was having trouble to set ale's repeat_action_probability to 1. Now I see how you are testing things. Thanks you very much.

@tambetm I created PR for changing initialization method. #33 I will try to investigate network syncing update mentioned above after this.

tambetm commented 8 years ago

Thanks @mthrok, merged the PR! Keep me posted if you figure out the network update interval.

kerawits commented 7 years ago

Quoted from the Nature paper:

target network update frequency: 10000: The frequency (measured in the number of parameter updates) with which the target network is updated (this corresponds to the parameter C from Algorithm 1). action repeat: 4: Repeat each action selected by the agent this many times. Using a value of 4 results in the agent seeing only every 4th frame update frequency: 4: The number of actions selected by the agent between successive SGD updates. Using a value of 4 results in the agent selecting 4 actions between each pair of successive updates

Since the agent sees the image and makes prediction once every 4th frame (due to action repeat = 4) and it only updates its online network once every 4th prediction (due to update frequency = 4), with target network update frequency = 10000, doesn't it mean that the target network should get updated on the 10000th update or once every 40000 predictions which is once every 160000 frames?

tambetm commented 7 years ago

@kerawits your reasoning seems valid. Because we see only every 4th frame, I think target_steps parameter value should be 40000. Would be nice if somebody could do a test run...

Seraphli commented 7 years ago

I run a code from here and the mean score seems be able to reach 400. BTW, I have change the network architecture to original DQN. The original code have commented the part.