swan-utokyo / deir

DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards
Apache License 2.0
16 stars 1 forks source link

Atari testing #1

Closed hlsafin closed 7 months ago

hlsafin commented 1 year ago

Was this ever tested on hard exploration games on Atari like Montezuma's Revenge or Pitfall? ?, If so, I'm curious to know how it performed.

swan-utokyo commented 1 year ago

Thank you for your interest in our work!

We preliminarily tested DEIR on Montezuma's Revenge and found it capable of exploring more efficiently there as well (cumulatively visits more distinct states, see definition in appendix A.7 of our ArXiv preprint).

We're working on tuning hyperparameters for Atari games and will provide more results and discussions once complete.

hlsafin commented 1 year ago

A.7 Detailed Experimental Results in MiniGrid. I don't see anything relating to Montezuma's Revenge. Maybe you can provide a link to this paper.

swan-utokyo commented 1 year ago

Sorry for my unclear explanation.

In Appendix A.7 of our ArXiv preprint, we defined two metrics for evaluating the exploration efficiency for an exploration method, using the number of distinct states visited per timestep by the agent.

At present, we have preliminarily tested and observed better exploration efficiency (based on the above two metrics) for DEIR in Montezuma's Revenge. We'll provide those results after our Atari experiments are complete.

hlsafin commented 1 year ago

Okay, well thank you for taking the time to respond. I have questions about the preliminary testing you did on Montezuma's revenge. What score did it reach and at what time step did it reach these values? Did you also test it on dense reward games and compared them to NGU and RND? Can you provide any graphs for them?

swan-utokyo commented 1 year ago

Thanks for your patience. We preliminarily trained DEIR, RND, NGU, and PPO agents in Montezuma's Revenge with 8 runs for each method. As shown in the below figure, DEIR agents approached an average return of 2500 using about 10 million frames. To get a higher score, the agent must learn to pass a level where only limited observational novelty is available and the agent's life can be easily lost. In such levels, novelty-driven methods (including RND, NGU, and DEIR) still require a considerable number of samples and a well-tuned training scheme.

For your reference, we also attached a sample video of a DEIR agent that explores in the game after 10M training frames. The above-mentioned hard game level can be found in the last five seconds of the sample video. We also demonstrate the intrinsic reward generated at each step in the video. Following RND's practice, we only generate non-negative intrinsic rewards in this experiment to encourage exploration and prevent the agent from losing its life and terminating the episode too shortly.

We did not deliberately test on games with dense rewards, since it is not the primary goal of this work. However, compared with MiniGrid games and Montezuma's Revenge, ProcGen games generally have much denser extrinsic rewards. You may refer to the ProcGen results reported in our paper (Figure 4). According to my experience, intrinsic exploration rewards are usually unnecessary when the environment is already densely rewarded. However, when rewards given by the environment are almost irrelevant to the novelty of observation, novelty-driven exploration methods may still help the agent to learn a better policy faster to some extent.

hlsafin commented 1 year ago

First of all, thank you for taking the time to respond and the effort you took for the visuals. A couple of things on this graph, How did NovelID perform on this task? Also, it's a bit strange to me that PPO is doing not so bad considering the original PPO paper, it got a low score on Montezuma's Revenge.

Also from your code ngu.py, line 149 (ngu_lifelong_rewards = ngu_rnd_error + 1), I didn't get why you added 1 and not just leave it as "ngu_rnd_error", but I do understand this part "lifelong_reward = min(max(ngu_lifelong_rewards[env_id], 1.0), L)" I am trying to implement r2d2, and I will certainly try to include your great work in it hopefully, it performs better in these environments.

swan-utokyo commented 1 year ago

Thanks for the questions. I am very glad to provide more information for your reference. We haven't tested NovelD and ICM on Montezuma's Revenge, but we'll schedule a quick test for them and update the results here in several days.

Our PPO implementation is based on Stable Baselines 3, which introduced several techniques to help PPO perform better while being more stable, including fine-tuned hyperparameters (e.g., using 1e-5 for the optimizer's epsilon), generalized advantage estimation (GAE) and per-batch advantage normalization (code). Also, we applied Layer Normalization to all CNN layers, and used GRU for all agents in our Atari experiments (see more discussions in A.4 Hyperparameters of our arXiv preprint). All of the above may collectively cause PPO agents to give different results, and our purpose here is to use an efficient but easy-to-implement baseline to save computing and implementation costs for experiments as much as possible.

We added a constant 1 to normalized ngu_rnd_error (ngu.py, L149) following the original definition given in the paper of NGU (see paragraph "Integrating life-long curiosity", page 4). It was said that:

We then define the modulator α_t as a normalized mean squared error, as done in Burda et al. (2018b): α_t = 1 + (err(x_t) - μ_e) / σ_e, where σ_e and μ_e are running standard deviation and mean for err(x_t).

The exact reason for adding 1 was not explicitly explained in NGU's paper. According to our understanding, ngu_rnd_error (i.e.,x_t) after normalization (ngu.py, L139-L147) shall follow a normal distribution with a mean value of 0, and then the mean of 1 + ngu_rnd_error is roughly 1. Given that, lifelong_reward = min(max(ngu_lifelong_rewards[env_id], 1.0), L) can be seen as mapping all positive ngu_rnd_errors to rewards greater than 1 but smaller than L=5 while mapping all non-positive errors to exact 1.

swan-utokyo commented 1 year ago

Again, thanks for your patience. We quickly ran 8 runs for NovelD and ICM in Montezuma's Revenge, using the same hyperparameters used in our previous preliminary experiments. Their results are merged with those of the other methods and shown in the figure below.

hlsafin commented 1 year ago

Thank you again for these charts, I know they aren't easy to produce since training can be quite time-consuming.

I tried to replicate some of your results on minigrid and pretty much got the same results. Any reason why NGU or DEIR only goes up to 2500 and not beyond it?? Is this mostly because this is an on-policy method without a replay buffer?

swan-utokyo commented 1 year ago

Thank you for taking time to replicate our experiments on MiniGrid, and we are happy to know it worked!

With sufficient training samples and an appropriate training scheme, we believe that both NGU and DEIR can collect more than 2500 scores in Montezuma's Revenge. An on-policy method without a replay buffer may be an important reason why more samples are required, but hard to say it is the only one. Considering that the original implementations of RND and NovelD are all based on PPO and have reached 10000+ scores, fine-tuning other hyperparameters and schemes for training should also help (e.g., learning rates, number of frames per roll-out, normalization schemes of intrinsic rewards and advantages, etc.). Although we tried to make each method perform as well as possible in our preliminary experiments, please note that we currently don't have enough time to ensure our hyperparameters and training schemes are the best.

hlsafin commented 11 months ago

Thank you for your response, I had to change line 181 (base_model.py) inside the "_get_rnd_embeddings" function. from gru_mems = self._get_rnn_embeddings(mems, cnn_embs, self.policy_rnns) to gru_mems = self._get_rnn_embeddings(mems, cnn_embs, self.model_rnns) Can you please verify if this is correct?

Also, if able, can you share the atari training portion as well?

swan-utokyo commented 11 months ago

If you changed L181, you may also need to change L178-179.

In our current implementation, line 181 of base_model.py shall only be reached when options rnd_use_policy_emb (L177) and use_model_rnn (L180) are enabled.

In case you wish to train separate CNN and RNN for RND agents, you can simply specify --rnd_use_policy_emb=0 --use_model_rnn=1 in your command. Please let me know if I misunderstood your purpose.

As for the Atari training portion, we implemented and tested it in a private repository. We'll try to resolve conflicts and release it sooner than later, but it may take some time due to other priorities. Thanks for your understanding.

AnneZhu1020 commented 11 months ago

Hi, it is a very interesting work. I have attempted Noveld on env MultiRoom-N6. When rnd_use_policy_emb and use_model_rnn are both 1, there shows: AttributeError: 'NovelDModel' object has no attribute 'policy_rnns'. When I tried --rnd_use_policy_emb=0 --use_model_rnn=1, the obtained return is only about 0.2. Can you give some instructions about reproduction for NovelD?

swan-utokyo commented 11 months ago

@AnneZhu1020 Thank you for your interest and for letting us know the error message! We are sorry that we unexpectedly removed policy_rnns passed into intrinsic reward models (including NovelD) during code refactoring. It has been fixed so you may now reuse the policy's RNN to train RND-based models.

I tested with the following command to train NovelD on my local. NovelD agents reached an average episodic return of 0.606925 using about 0.5 million frames in MultiRoom-N6, with both the policy's CNN and RNN reused. (P.S. Please note that the results may still vary on different devices to some extent.)

PYTHONPATH=./ python3 src/train.py \
  --int_rew_source=NovelD  --env_source=minigrid  --game_name=MultiRoom-N6 \
  --rnd_use_policy_emb=1  --use_model_rnn=1 \
  --features_dim=128  --model_features_dim=128  --model_latents_dim=128 \
  --int_rew_coef=3e-2  --rnd_err_norm=0

In the above command, --rnd_use_policy_emb=1 --use_model_rnn=1 is to reuse the policy's CNN and RNN for NovelD. With those two options enabled, it also means that the number of features input into NovelD's MLP must be equal to the number of features output by the policy's CNN & RNN (otherwise, additional FC layers are needed). Thus, we explicitly specify --features_dim=128 --model_features_dim=128 --model_latents_dim=128 to define the dimension of features.

As for --int_rew_coef=3e-2 --rnd_err_norm=0, they are hyperparameter values taken from Table A1 (in the appendix of our arXiv preprint). Since the default value of each option in train.py is optimized for DEIR, when training other methods, please find and use the corresponding hyperparameter values specified in Table A1. We are sorry for the inconvenience, and will try to create pre-defined config files for all methods later.

hlsafin commented 8 months ago

Can you provide some of your parameters for Montezuma's Revenge? Also for a harder exploration game, what parameters should one try for Pitfall!? It seems like your batch-sizes are quite large, 1024 and n-step; any reason as to why? How did this play in exploration?

swan-utokyo commented 8 months ago

@hlsafin Thanks for your questions. For your reference, the following are major options we used for testing DEIR's performance in Montezuma's Revenge:

--game_name=MontezumaRevenge-v5 --atari_max_steps=4500 --atari_gray_scale=1 --atari_img_size=64 --atari_stack_num=1 --episodic_life=0 --frame_skip=8 --use_inter_area_resize=0 --repeat_act_prob=0.0 --atari_clip_rew=1 --num_processes=128 --n_steps=128 --batch_size=512 --learning_rate=3e-5 --model_learning_rate=1e-3 --gamma=0.999 --gae_lambda=0.95 --pg_coef=1.0 --vf_coef=0.5 --ent_coef=1e-2 --clip_range=0.1 --clip_range_vf=0.1 --n_epochs=3 --model_n_epochs=3 --ext_rew_coef=10 --int_rew_coef=1e-3 --int_rew_norm=3 --int_rew_momentum=-1 --int_rew_source=DEIR --use_model_rnn=1 --adv_norm=1 --adv_eps=1e-5 --adv_momentum=0.9 --features_dim=512 --latents_dim=512 --policy_cnn_type=1 --model_cnn_type=1 --model_features_dim=512 --model_latents_dim=512 --policy_cnn_norm=LayerNorm --policy_mlp_norm=NoNorm --policy_gru_norm=NoNorm --model_cnn_norm=LayerNorm --model_mlp_norm=NoNorm --model_gru_norm=NoNorm --policy_mlp_layers=1 --model_mlp_layers=1 --gru_layers=1

Since we haven't experimented with Pitfall!, we are sorry that we cannot provide constructive suggestions for you at this moment.

We decided to use a larger mini-batch size when training ProcGen agents mainly because it was suggested by previous studies [Cobbe et al., 2020 and 2021], and we also observed non-trivial performance improvements during training when using a larger mini-batch size (e.g., 2048 sample per batch).

We also used a larger n_steps in MiniGrid games because we empirically found this could help us get more stable reward normalization results for samples obtained in the same episode but across different RL rollouts.

hlsafin commented 8 months ago

Thank you, there seems to be a lot of parameter/hyperparameters to choose from. How does one go about picking one over the other for different types of environments? What is your thought process for choosing them? What are the key parameters that play a significant role or a detrimental role in the success of an agent in a given environment?

swan-utokyo commented 8 months ago

Again, thank you for your question. I understand tuning hyper-parameters is tedious and difficult, especially in sparse reward RL tasks. I would suggest trying to use the same parameters as prior studies first, especially when reproducing PPO/R2D2 baselines, and then gradually adjusting hyper-parameters related to the generation of intrinsic exploration rewards.

Usually, simpler environments require fewer training resources, so I always start with simpler environments to validate parameters first and, once successful, attempt more challenging environments. This is why we experimented with procedurally generated environments like MiniGrid and ProcGen. Atari doesn't support such customization, which can increase the difficulty and cost of experiments.

Regarding the hyper-parameters that I found important, the coefficients for intrinsic reward and extrinsic reward are the most crucial, and the optimal ratio may vary in different environments. Additionally, in long-horizon games, a larger gamma (discounting factor) is usually needed, but a too-large gamma can also slow down the speed of learning. Similarly, lower learning rates and clip ranges usually work better in more difficult tasks, but the optimal values still need to be determined through experimentation.

hlsafin commented 8 months ago

Thank you for your elaboration, I will take note of this when making further experiments. I've been trying to do experiments on Solaris and Pitfall! environments with your hyperparameters, however, what I've found to be the most important indicator of learning is the max-step-size. I tried Solaris with 4500 steps, and I learned very good policies in a short amount of time. However, when I took off max_step to default value, which I believe is 108,000, or roughly 30 mins of play. It failed to learn anything. Even the differences between max_step of 4500 and 8500 was a massive one. For the 8500 max step, it stayed around 2k reward and barely made any improvements for the same number of steps. Am I doing something wrong here? . I haven't experimented with NovelID or NGU, but I presume I would get similar results. Exploration of the long-term environment is still a hard problem, is this assessment correct? Care to give any advice to get decent results on this? Or provide some better parameters?