rstrivedi / Melting-Pot-Contest-2023

Apache License 2.0
43 stars 67 forks source link

NaN episode rewards during baseline training #14

Open itstyren opened 1 year ago

itstyren commented 1 year ago

Problem: I am encountering an issue while running the MeltingPot baseline Ray training model. The episode rewards I am getting are consistently NaN (Not-a-Number).

image

Steps to Reproduce: python baselines/train/run_ray_train.py --num_gpus 1 --wandb True The training args are set as"

        # training
        "seed": args.seed,
        "rollout_fragment_length": 5, # Divide episodes into fragments of this many steps each during rollouts. 
        "train_batch_size": 40, # Batch size (batch * rollout_fragment_length) Trajectories of this size are collected from rollout workers and combined into a larger batch of train_batch_size for learning. 
        "sgd_minibatch_size": 32, #  PPO further divides the train batch into minibatches for multi-epoch SGD
        "disable_observation_precprocessing": True,
        "use_new_rl_modules": False,
        "use_new_learner_api": False,
        "framework": args.framework,  # torch or tensorflow 

        # agent model
        "fcnet_hidden": (4, 4), # fully connected network
        "post_fcnet_hidden": (16,), # Layer sizes after the fully connected torso.
        "cnn_activation": "relu",
        "fcnet_activation": "relu",
        "post_fcnet_activation": "relu",
        # == LSTM ==
        "use_lstm": True,
        "lstm_use_prev_action": True,
        "lstm_use_prev_reward": False,
        "lstm_cell_size": 2,  # A cell, is an LSTM unit 
        "shared_policy": False,

Please let me know if there's any additional information or logs needed to diagnose this issue. Thank you for your assistance in resolving this problem.

emanueltewolde commented 1 year ago

I get the same issue, except, I only get one NaN data point in total instead of the one per step. If I run the evaluation script, however, it reports positive rewards for the agents. So I am also wondering what is going on there

rstrivedi commented 1 year ago

What are the number of workers you are using for this experiment? I have only seen this problem when there is a mismatch between train_batch_size, num_workers and sgd_minibatch_size. Check if you get the same error if you reduce sgd_minibatch size to 8 or 16.

itstyren commented 1 year ago

Thanks for your reply.

What are the number of workers you are using for this experiment?

In this experiment, I have maintained the default number of workers, which is num_workers=2.

Check if you get the same error if you reduce sgd_minibatch size to 8 or 16.

It seems that reducing the sgd_minibatch_size below 20 is not feasible due to the following error raised by RLlib:

ValueError: `sgd_minibatch_size` (16) cannot be smaller than`max_seq_len` (20).

To investigate whether there is a discrepancy between train_batch_size, num_workers, and sgd_minibatch_size, I increased the training sample as provided default setting:

"rollout_fragment_length": 10,
 "train_batch_size": 400, 
"sgd_minibatch_size": 32, 

executed with the command: python baselines/train/run_ray_train.py --num_gpus 1 --wandb True , the issue still persists, as shown in this wandb report.

rstrivedi commented 1 year ago

ah, correct about the max_seq_len. you can change that by adding it to your config too but it should always be greater than equal to sgd_minibatch_size.

Could you change how long you train? So, change to 10 iterations or so and see if you get Nan for all iterations or only for first few?

itstyren commented 1 year ago

Thanks, @rstrivedi, I can confirm that this problem persists even during long-time runs, as demonstrated in this report.

However, I just noticed a discrepancy between num_agent_steps_sampled and num_env_steps_sampled for the default settings, which are as follows: Metric Value
num_agent_steps_sampled 3,200
num_agent_steps_trained 3,200
num_env_steps_sampled 400
num_env_steps_trained 400

When I modify train_batch_size to 3200, I am able to obtain accurate episode rewards. I'm uncertain if this is where the issue originates. Do you have any suggestions or insights regarding this matter? Thanks!