nsidn98 / InforMARL

Code for our paper: Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation
https://nsidn98.github.io/InforMARL/
MIT License
53 stars 12 forks source link

Understanding the effects of n_rollout_threads #20

Closed Michael-Elrod-dev closed 6 days ago

Michael-Elrod-dev commented 1 week ago

Hello again,

I have been using this repo for studying and testing for some time now (using the navigation_graph scenario) and earlier this week I ran into an interesting issue.

To make testing/debugging easier I have set the number of n_rollout_threads to 1 to remove the parallel processing to make it a little easier to track environmental data over time.

However, when you do this, the network completely losses its ability to converge towards any kind of optimal policy and I do not understand why since the implementation of parallel processing should only effect the speed of training and not the effectiveness.

For more context here is the command I used to remove parallel processing where n_rollout_threads=1 and auto_mini_batch_size has been removed in order to maintain the target batch size of 128:

python -u onpolicy/scripts/train_mpe.py --use_valuenorm --use_popart --project_name "GNN-Testing" --env_name "GraphMPE" --algorithm_name "rmappo" --seed 0 --experiment_name "baseline" --scenario_name "navigation_graph" --num_agents 3 --collision_rew 5 --n_training_threads 1 --n_rollout_threads 1 --num_mini_batch 1 --episode_length 25 --num_env_steps 2000000 --ppo_epoch 10 --use_ReLU --gain 0.01 --lr 7e-4 --critic_lr 7e-4 --user_name "......" --use_cent_obs "False" --graph_feat_type "relative" --target_mini_batch_size 128 --use_wandb "False"

Here is the command I typically use where training converges succesfully:

python -u onpolicy/scripts/train_mpe.py --use_valuenorm --use_popart --project_name "GNN-Testing" --env_name "GraphMPE" --algorithm_name "rmappo" --seed 0 --experiment_name "baseline" --scenario_name "navigation_graph" --num_agents 3 --collision_rew 5 --n_training_threads 1 --n_rollout_threads 128 --num_mini_batch 1 --episode_length 25 --num_env_steps 2000000 --ppo_epoch 10 --use_ReLU --gain 0.01 --lr 7e-4 --critic_lr 7e-4 --user_name "......." --use_cent_obs "False" --graph_feat_type "relative" --auto_mini_batch_size --target_mini_batch_size 128 --use_wandb "False"

Using the first command, the average reward over time isn't even oscillating it is just random.

Do you have any insight on why this is happening?

nsidn98 commented 1 week ago

Yes, we had a similar experience with lower number of rollout-threads. You can find some explanation for it here (decorrelating experience section).

Michael-Elrod-dev commented 6 days ago

How interesting, thank you for sharing!