uoe-agents / epymarl

An extension of the PyMARL codebase that includes additional algorithms and environment support
Apache License 2.0
483 stars 136 forks source link

network type configuration #24

Closed miguel-arrf closed 2 years ago

miguel-arrf commented 2 years ago

Hi!

I'm trying to train the 'rware:rware-tiny-2ag-v1' environment with MAPPO, and I'm following the 'Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks' paper hyperparameters tables!

One of the configurations is the network type. In the paper I understand that two types of networks are used: GRU and FC, however I'm not finding where can I set the type of network I want to use in EPyMARL.

Thanks in advance!

semitable commented 2 years ago

Hi!

The GRU is used if you enable the recurrent variable in the configuration. Just set use_rnn: True. If false then the default FC will be used instead.

miguel-arrf commented 2 years ago

Thank you for the help!

Also, and I'm not sure if I should create a new issue, or if this is even the right place to talk about this but,

I'm trying to replicate the results with MAPPO for Rware:"tiny 2p", but, after 6 million steps, the mean return that I obtain is '0.06'.

Looking at the image below, it should already be around 5 and not 0.06 (or something similar). (the legend on the graph says that the scale is 1e6, but from what I could understand from another issue, it is 1e7).

Screenshot 2022-06-07 at 13 36 47
# --- MAPPO specific parameters ---

action_selector: "soft_policies"
mask_before_softmax: True

runner: "parallel"

buffer_size: 10
batch_size_run: 10
batch_size: 10

env_args:
  state_last_action: False # critic adds last action internally

# update the target network every {} training steps
target_update_interval_or_tau: 200

lr: 0.0005

obs_agent_id: True
obs_last_action: False
obs_individual_obs: False

agent_output_type: "pi_logits"
learner: "ppo_learner"
entropy_coef: 0.001
use_rnn: False
standardise_returns: True
standardise_rewards: False
q_nstep: 10 # 1 corresponds to normal r + gammaV
critic_type: "cv_critic"
epochs: 4
eps_clip: 0.2
name: "mappo"

t_max: 40000000

# Added by me:
save_model: True # Save the models to disk
save_model_interval: 50000 # Save models after this many timesteps
hidden_dim: 128 # Size of hidden state for default rnn agent

I'm using the configurations above, am I missing something? :(

I'm running with:

 python src/main.py --config=mappo --env-config=gymma with env_args.time_limit=500 env_args.key="rware:rware-tiny-2ag-v1"

Thanks!

semitable commented 2 years ago

Hi! That's right, as we mention in the paper all the on-policy algorithms have their timesteps scaled-down 10 times. Essentially, from the 10 parallel environments, we just count timesteps from one. This way we can be fairer when comparing algorithms that use a replay buffer (and keep the number of backprop steps equal).

miguel-arrf commented 2 years ago

Thank you!

The problem is really that the results that I'm getting aren't remotely as good as yours :/. There is something that I am probably not setting up as you did.

Thanks!

semitable commented 2 years ago

Sorry if I was unclear: we scaled those numbers in the paper, but it does not happen automatically in EPyMARL. Did you try leaving it 10x times more? Your plot only goes to 4e6 but it should go to 4e7. If the problem persists then we should better open a new issue to resolve. Thanks

semitable commented 2 years ago

Sorry, I just saw this is our plot and not yours. :) Still, make sure you have run the full number of steps.

Also, I noticed a couple of discrepancies between your hyperparameters and the ones presented in the paper. For instance, we have "standardise_returns/rewards" to False for MAPPO/RWARE. Also, we have tau=0.01 (soft updates) while you have 200 (hard updates). Please consult table 24 in the paper for the exact parameters. Thanks again!

miguel-arrf commented 2 years ago

Hi!

Yeah, that was it! :P. The results are now very similar to the ones published! I wanted to have them working first because I am tinkering with the environment to try some new mechanics :).

Once again, thank you so much for the help!