swan-utokyo / deir

DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards
Apache License 2.0
16 stars 1 forks source link

Atari code #3

Open hlsafin opened 2 months ago

hlsafin commented 2 months ago

Hi, I've spent the past week or so trying to recover the results that you posted previously to one of my questions regarding Montezuma's revenge. It doesn't get anything above 100 rewards even after 10 million steps. With the same hyperparamters you posted. My question is, is it possible to submit your code for Atari as well? Maybe on a different branch?

thank you

swan-utokyo commented 2 months ago

Hi @hlsafin, thank you for your post.

At request, we have uploaded the code from our preliminary Atari experiments conducted in June 2023 in a separate branch. I am sorry that since we haven't been working on this project for a while, we haven't had the time to organize and comment on the code in that branch. Still, we believe that with the code from it and the hyperparameters we shared earlier, your agents should be able to obtain returns higher than 100 in Montezuma's Revenge.

However, there are a few points to note:

Firstly, since we didn't conduct any exhaustive hyperparameter search for Atari games, the hyperparameters we provided above are for reference only and are likely not optimal. For example, we found that setting frame_skip=8 can lead to faster convergence, but skipping too many frames may cause the agent to miss the right timing to pass certain obstacles, and this might be a reason why it is hard to get a return higher than 2500. However, conversely, if too few frames are skipped, more timesteps would be needed to obtain the same rewards, and hyper-parameters such as gamma and gae_lambda would need adjustments accordingly.

Secondly, in the first room of Montezuma's Revenge, the agent may choose to open either the door in the upper left or the upper right after getting the key, and get into different rooms. It the agent chooses to open the right door, it may get 2500 points within a short time. If it chooses to open the left door, it may only get 400 points in the end. As discussed in the RND paper [1] (in the footnote on page 8), the agent's choices are completely random, but once it receives an extrinsic reward from its choice, it is likely to consistently choose the same door in later episodes. Therefore, among all experiments with non-zero returns, some experiments may converge to an episode return of 2500, while other experiments may only converge to 400.

Furthermore, we observed that even when using the provided hyperparameters, DEIR may occasionally fail to get any rewards (with a relatively low probability). This could be related to the advantage normalization method we adopted, which can accelerate the convergence rate by scaling up near-zero advantages, but it may also lead to instability when updating network parameters. If you encounter such issues, we suggest stopping the use of advantage normalization (though you may also need to adjust other hyperparameters accordingly).


[1] Burda Y, Edwards H, Storkey A, Klimov O. Exploration by random network distillation. In Seventh International Conference on Learning Representations, May 9, 2019.