uoe-agents / epymarl

An extension of the PyMARL codebase that includes additional algorithms and environment support
Apache License 2.0
483 stars 136 forks source link

Problem on reproducing LBF results #3

Closed to1a closed 2 years ago

to1a commented 3 years ago

Thanks for your work.

I am trying to train algorithms on the LBF environment, but when I test algorithms I found that the results are significantly worse than those in the Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. For example, when I run 10 seeds of VDN on 15x15-3p-5f in 2M steps, none of those seeds can achieve a return higher than 0.2, while the average return in the paper is 0.58. I wonder if I mistake any essential parameters? Here‘s one of my config.json:

{ "action_selector": "epsilon_greedy", "agent": "rnn", "agent_output_type": "q", "batch_size": 32, "batch_size_run": 1, "buffer_cpu_only": true, "buffer_size": 5000, "checkpoint_path": "", "double_q": true, "env": "gymma", "env_args": { "key": "lbforaging:Foraging-15x15-3p-5f-v1", "pretrained_wrapper": null, "time_limit": 50 }, "epsilon_anneal_time": 200000, "epsilon_finish": 0.05, "epsilon_start": 1.0, "evaluate": false, "evaluation_epsilon": 0.0, "gamma": 0.99, "grad_norm_clip": 10, "hidden_dim": 128, "hypergroup": null, "label": "default_label", "learner": "q_learner", "learner_log_interval": 10000, "load_step": 0, "local_results_path": "results", "log_interval": 50000, "lr": 0.0003, "mac": "basic_mac", "mixer": "vdn", "name": "vdn", "obs_agent_id": true, "obs_individual_obs": false, "obs_last_action": false, "optim_alpha": 0.99, "optim_eps": 1e-05, "repeat_id": 1, "runner": "episode", "runner_log_interval": 10000, "save_model": true, "save_model_interval": 50000, "save_replay": false, "seed": 291174067, "standardise_rewards": true, "t_max": 2050000, "target_update_interval_or_tau": 0.01, "test_greedy": true, "test_interval": 50000, "test_nepisode": 100, "use_cuda": true, "use_rnn": true, "use_tensorboard": true }

semitable commented 3 years ago

Thanks for spotting this. We have been crunching some numbers today, and agree there might be a discrepancy between our results and the ones shown in the draft version of the NeurIPS paper. We think that the numbers on the table are for 20M training timesteps instead of 2M (this only affects off-policy algorithms in LBF) and so we plan on updating them before the camera-ready deadline (after double-checking that this is indeed the problem).

Therefore if you want to get better policies I would recommend increasing "t_max" to "20050000".

Thanks again and sorry for the inconvenience. Thankfully we'll have enough time to double-check and update everything until the CR version of the paper.

I am keeping this issue open until it is fully resolved.

to1a commented 3 years ago

Thanks for your immediate reply. After changing the numbers on the table to 20M, I think my results on LBF are consistent with your paper.

semitable commented 2 years ago

Closing since CR & arxiv have been updated with correct numbers.