The rewards cannot be obtained in the MiniGrid-ObstructedMaze environment.

MurrayMa0816 commented 9 months ago

Hi @swan-utokyo, I have also been focusing on the sparse reward problem recently. Your DEIR work has given me a lot of inspiration. I am planning to further study based on your work. However, when using your code, the following problem occurred:

In Minigrid, results for various sizes in the environments of MultiRoom and KeydorrCorridor can be reproduced. However, in the MiniGrid-ObstructedMaze environments, including 1Dlh, 2Dlh, 1Dlhb, 2Dlhb, 1Q, 2Q, Full, using the same code and parameters as MultiRoom and KeydorrCorridor, no rewards are obtained even after training for over 5e7 steps. However, in Figure 5 of the paper, convergence has already been achieved in the 'Full' environment after training for 5e7 steps. So, I would like to ask how to reproduce the results in the MiniGrid-ObstructedMaze environment. Do I need to use different parameter settings than those for MultiRoom and KeydorrCorridor?
I noticed that you have also implemented the NovelD algorithm separately. Regarding the NovelD algorithm, I encountered the same issue of not obtaining any results in the MiniGrid-ObstructedMaze environments, including 1Dlhb, 2Dlhb, 1Q, 2Q, and Full. For the hyperparameters, I adjusted them according to Appendix A.4, but still couldn't achieve any results.

I hope to get some guidance and advice from you. Thank you very much.

swan-utokyo commented 9 months ago

@MurrayMa0816 Thank you for your interest in the sparse reward problem and our research.

As you surmised, we employed special hyperparameters for DEIR and NovelD in ObstructedMaze-Full that differ from those used in other MiniGrid environments. Those hyperparameters can be found in Table A1 (in the Appendix) in the ArXiv version of our paper [link]. Within the limits of our computational resources, we searched as many hyperparameters as possible to ensure that both DEIR and NovelD could perform optimally. We have not conducted experiments in other ObstructedMaze tasks (such as 1Dlh, 2Dlh, etc.). Still, we believe that people can adjust their hyperparameters based on the figures provided in Table A1 and find suitable ones.

Compared to simpler tasks like DoorKey and KeyCorridor, different hyperparameters are demanded mainly due to the increased game difficulty in ObstructedMaze-Full, in which the agent must perform more actions in specific orders to obtain the final sparse rewards. Given this condition, we found it necessary to increase sample diversity and gradually improve the agent policy slower to avoid the agent converging to a local optimum too early. This is especially important when negative rewards are included in the normalized intrinsic rewards, and extrinsic rewards become more sparse. If the learning rate is too fast, agents might avoid active exploration and opt to end the game early in environments where it's possible (like in Atari games).

If you encounter any issues while reproducing our work or have different views, feel free to leave your messages here for discussion. Thanks.

swan-utokyo commented 8 months ago

@MurrayMa0816 We recently retrained our model and confirmed that the results we reported can be reproduced using the hyper-parameters specified in Table A1. However, it is possible that some hyper-parameters not listed in the table are also critical to the experiment results. Therefore, we provide the complete commands used to train DEIR and NovelD below for your reference.

DEIR

PYTHONPATH=./ python3 src/train.py \
    --run_id=0 --game_name=ObstructedMaze-Full --project_name=DEIR-MiniGridExps \
    --env_source=minigrid --total_steps=60000000 \
    --num_processes=64 --n_steps=256 --batch_size=2048 \
    --learning_rate=1e-4 --model_learning_rate=1e-4 \
    --gamma=0.99 --gae_lambda=0.95 --pg_coef=1.0 --vf_coef=0.5 \
    --clip_range=0.2 --clip_range_vf=0.2 --ent_coef=5e-4 \
    --optimizer=adam --adam_beta1=0.9 --optim_eps=1e-5 \
    --n_epochs=3 --model_n_epochs=3 --activation_fn=relu --cnn_activation_fn=relu \
    --adv_norm=0 --adv_eps=1e-5 --adv_momentum=0.9 \
    --int_rew_coef=1e-3 --int_rew_norm=1 --int_rew_momentum=0.99 --ext_rew_coef=10 \
    --int_rew_source=DEIR --use_model_rnn=1 --dsc_obs_queue_len=100000 \
    --features_dim=256 --latents_dim=256 --policy_cnn_type=0 \
    --model_features_dim=256 --model_latents_dim=256 --model_cnn_type=0 \
    --policy_cnn_norm=LayerNorm --policy_mlp_norm=NoNorm --policy_gru_norm=NoNorm \
    --model_cnn_norm=LayerNorm --model_mlp_norm=NoNorm --model_gru_norm=NoNorm \
    --policy_mlp_layers=1 --model_mlp_layers=1 --gru_layers=1 \
    --enable_plotting=0 --use_status_predictor=0 --write_local_logs=1 \
    --group_name=OMFull_DEIR_ER10_IR1e-3_.99_LR1e-4_Cpu64_Stp256_Btc2k_Mlp1l_VClp.2_Ent5e-4

NovelD

PYTHONPATH=./ python3 src/train.py \
    --run_id=0 --game_name=ObstructedMaze-Full --project_name=DEIR-MiniGridExps \
    --env_source=minigrid --total_steps=60000000 \
    --num_processes=64 --n_steps=256 --batch_size=2048 \
    --learning_rate=1e-4 --model_learning_rate=1e-4 \
    --gamma=0.99 --gae_lambda=0.95 --pg_coef=1.0 --vf_coef=0.5 \
    --clip_range=0.2 --clip_range_vf=0.2 --ent_coef=5e-4 \
    --optimizer=adam --adam_beta1=0.9 --optim_eps=1e-5 \
    --n_epochs=3 --model_n_epochs=3 --activation_fn=relu --cnn_activation_fn=relu \
    --adv_norm=0 --adv_eps=1e-5 --adv_momentum=0.9 \
    --int_rew_coef=3e-3 --int_rew_norm=1 --int_rew_momentum=0.99 --ext_rew_coef=10 \
    --int_rew_source=NovelD --use_model_rnn=0 --dsc_obs_queue_len=100000 \
    --rnd_use_policy_emb=1 --rnd_err_norm=1 --rnd_err_momentum=-1 \
    --features_dim=256 --latents_dim=256 --policy_cnn_type=0 \
    --model_features_dim=256 --model_latents_dim=256 --model_cnn_type=0 \
    --policy_cnn_norm=LayerNorm --policy_mlp_norm=NoNorm --policy_gru_norm=NoNorm \
    --model_cnn_norm=LayerNorm --model_mlp_norm=NoNorm --model_gru_norm=NoNorm \
    --policy_mlp_layers=1 --model_mlp_layers=1 --gru_layers=1 \
    --enable_plotting=0 --use_status_predictor=0 --write_local_logs=1 \
    --group_name=OMFull_NovelD_ER10_IR3e-3_.99_LR1e-4_Cpu64_Stp256_Btc2k_Mlp1l_VClp.2_Ent5e-4

MurrayMa0816 commented 8 months ago

Hi @swan-utokyo， Thank you very much for providing detailed parameters, which facilitate my reproduction and learning process. Additionally, did you encounter situations of highly unstable reward curves when fine-tuning hyperparameters? Even after the rewards have converged, did you experience sudden occurrences of zero rewards? Typically, which parameters are most correlated with such occurrences? Thank you very much for your assistance.

swan-utokyo commented 8 months ago

@MurrayMa0816 Yes, we encountered the issue you mentioned in certain environments, where the episodic return may suddenly drop down during training or even after convergence.

We haven't investigated which hyperparameters are specifically related to this. Still, based on my experience, in addition to the learning rate, the coefficient of maximum entropy loss, and the normalization methods (including normalization of model layers, intrinsic rewards, and advantages) are the most likely related factors. They are originally used to facilitate exploration and convergence, but when outlier values appear in the rollout samples, they may also cause the agent's policy to fluctuate violently.

If this issue has a significant impact on your experiment, I suggest you consider lowering the coefficient of maximum entropy loss, clipping the results of normalization, or disabling some normalization methods when necessary.

MurrayMa0816 commented 8 months ago

@swan-utokyo， Got it, thanks for your valuable experience.

MurrayMa0816 commented 8 months ago

Hi @swan-utokyo, thanks for providing the parameters. I have obtained the results shown in Fig 5, but there are still a few detailed issues I would like to consult with you again. Thank you very much:

Is the result of the "Mean episodic return" in Fig 5 of the paper from the values in "rollout/ep_info_rew_mean"?
My result is shown in the figure below, and there are some discrepancies compared to those presented in the paper. Do we need to do some processing on the data inside "rollout/ep_info_rew_mean", such as smoothing?
Is the Fig 5 from the results of multiple random seeds? Have you encountered different random seeds resulting in completely different results, such as oscillations or failure to converge at all? Is there anything to pay attention to when selecting random seeds?

swan-utokyo commented 8 months ago

Hi @MurrayMa0816, I am glad that you've obtained the results, as shown in Fig.5.

As for your further questions, we do use rollout/ep_info_rew_mean when computing mean episodic returns. It is identical to the rollout/ep_rew_mean metric in the Stable-Baselines3 library (see references below), which calculates the mean return of the most recent 100 episodes by default. So, you may consider the metric itself has already been smoothed. (The ep_rew_mean in our code is a new metric we defined to calculate the mean return of the episodes that end in one rollout.)

Certainly, you may further adjust the smoothness of the curve according to your needs, but I believe there are two other factors that are more critical for reproduction experiments.

The number of random seeds: As you mentioned, different random seeds have very different results. In our paper, we ran ObstructedMaze-Full tasks with 12 random seeds (i.e., run_id = 0, 1, ..., 11), and presented the mean episodic return of all the 12 runs with standard error in Fig.5. Generally speaking, the more random seeds you experiment with, the more stable and reliable your mean return will be. Thus, we recommend you run with more random seeds and compare the average return of them. However, we don’t recommend manually choosing random seed values, as this could introduce human bias into the experimental results. If you're interested in this issue, you may want to refer to this paper for more detailed suggestions.
Hardware devices and library versions: In our experiments, we noted that differences in hardware and library versions (e.g., CPU/GPU, PyTorch/Numpy/CUDA driver versions) can lead to different experimental results. An experiment can only be wholly reproduced with no discrepancies when identical hardware devices and software configurations are adopted. Nonetheless, when training with a sufficient number of random seeds, the conclusion of an experiment should remain consistent, i.e., a model's performance and convergence speed relative to its baseline should be roughly the same, even with different hardware and library configurations.

swan-utokyo / deir

The rewards cannot be obtained in the MiniGrid-ObstructedMaze environment. #2