kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

val/loss=nan.0 when training with ddp (distributed data parallel) #22

Closed leannmlindsey closed 2 months ago

leannmlindsey commented 2 months ago

Hello, I have access to a system that has 4 GPUs per node. I wanted to run on 8 gpus so I modified your script (see below). It started without error and appeared to be training properly and I checked to see that all GPUs were being used and they were, but at the first evaluation I noticed that the loss and perplexity values for train and val were =nan.0

Do you have better suggestions on how to run on GPUs that are not on the same node?

What I changed:

Removed this line from
run_pretrain_caduceus.sh

trainer.devices=${NUM_DEVICES} \

Then changed the configuration file this way:

~caduceus/configs/trainer/default.yaml

target: pytorch_lightning.Trainer

devices: 1

devices:

accelerator: ddp

accelerator: gpu

accumulate_grad_batches: 1 # Gradient accumulation every n batches max_epochs: 200

accelerator: ddp # Automatically set if gpus > 1

gradient_clip_val: 0.0 log_every_n_steps: 10 limit_train_batches: 1.0 # train on full dataset, can be used to toggle quick run limit_val_batches: 1.0 # validate on full dataset, can be used to toggle quick run num_sanity_val_steps: 2 # default value: 2; override to 0 to skip sanity checking

yair-schiff commented 2 months ago

At what step did you hit the NaN? I think the model is training at too high a LR for too long. You set max_epochs to 200 which is much higher than what we used. For example, at 10k training steps we don't even complete a full epoch. The learning rate scheduler is decided based on max number of epochs/steps so my guess is that your LR is too high for this long of training

leannmlindsey commented 2 months ago

I am getting the nan.0 in epoch 0, at the very first time that it does an evaluation on the val file

I did not make any change to the max_epochs. That is the default in the ~/caduceus/configs/trainer/default.yaml file.

The only change that I made to the run_pretrain_caduceus.sh file was to remove the line overriding the train.devices, so I believe that the learning rate is set in that file with

LR="8e-3" optimizer.lr="${LR}"

I am currently setting up a new environment and cloning again from scratch, to see if perhaps I made some other error along the way...I will let you know how that goes.

leannmlindsey commented 2 months ago

Unfortunately, it did happen again with the clean install (and the changes listed above). I did make one more change, since the batch size was too large and causing and OOM I changed

BATCH_SIZE=$(( 1048576 / (SEQLEN*4) ))

to reduce the batch size

I uploaded the log file so that you can take a look. caduceus_ps_3413538.log

yair-schiff commented 2 months ago

You can keep the same effective / global batch size by increasing gradient accumulation steps. So you can have train.global_batch_size be 1024 but reduce the dataset.batch_size to be the largest that fits on your machine. Gradient accumulation should be computed automatically by the config file here

  accumulate_grad_batches: ${div_up:${train.global_batch_size}, ${eval:${trainer.devices} * ${dataset.batch_size} * ${trainer.num_nodes}}}
leannmlindsey commented 2 months ago

Thank you, Yair. I will try that.

Have you been successful running caduceus across multiple nodes using ddp?

Updated

I tried again on a different cluster and I had the same error.

right now I am just training everything with 4 GPUs on 1 nodes, which is not taking too much time (approximately 3 hrs for SEQLEN=1024) so perhaps it doesn't matter that I can't scale it up.

yair-schiff commented 2 months ago

I have not tried multi-node training

leannmlindsey commented 2 months ago

Don't worry. It trains so fast, multi node is probably not necessary.