Closed leannmlindsey closed 2 months ago
At what step did you hit the NaN
? I think the model is training at too high a LR for too long. You set max_epochs
to 200 which is much higher than what we used. For example, at 10k training steps we don't even complete a full epoch. The learning rate scheduler is decided based on max number of epochs/steps so my guess is that your LR is too high for this long of training
I am getting the nan.0 in epoch 0, at the very first time that it does an evaluation on the val file
I did not make any change to the max_epochs. That is the default in the ~/caduceus/configs/trainer/default.yaml file.
The only change that I made to the run_pretrain_caduceus.sh file was to remove the line overriding the train.devices, so I believe that the learning rate is set in that file with
LR="8e-3" optimizer.lr="${LR}"
I am currently setting up a new environment and cloning again from scratch, to see if perhaps I made some other error along the way...I will let you know how that goes.
Unfortunately, it did happen again with the clean install (and the changes listed above). I did make one more change, since the batch size was too large and causing and OOM I changed
BATCH_SIZE=$(( 1048576 / (SEQLEN*4) ))
to reduce the batch size
I uploaded the log file so that you can take a look. caduceus_ps_3413538.log
You can keep the same effective / global batch size by increasing gradient accumulation steps. So you can have train.global_batch_size
be 1024 but reduce the dataset.batch_size
to be the largest that fits on your machine. Gradient accumulation should be computed automatically by the config file here
accumulate_grad_batches: ${div_up:${train.global_batch_size}, ${eval:${trainer.devices} * ${dataset.batch_size} * ${trainer.num_nodes}}}
Thank you, Yair. I will try that.
Have you been successful running caduceus across multiple nodes using ddp?
Updated
I tried again on a different cluster and I had the same error.
right now I am just training everything with 4 GPUs on 1 nodes, which is not taking too much time (approximately 3 hrs for SEQLEN=1024) so perhaps it doesn't matter that I can't scale it up.
I have not tried multi-node training
Don't worry. It trains so fast, multi node is probably not necessary.
Hello, I have access to a system that has 4 GPUs per node. I wanted to run on 8 gpus so I modified your script (see below). It started without error and appeared to be training properly and I checked to see that all GPUs were being used and they were, but at the first evaluation I noticed that the loss and perplexity values for train and val were =nan.0
Do you have better suggestions on how to run on GPUs that are not on the same node?
What I changed:
Removed this line from
run_pretrain_caduceus.sh
trainer.devices=${NUM_DEVICES} \
Then changed the configuration file this way:
~caduceus/configs/trainer/default.yaml
target: pytorch_lightning.Trainer
devices: 1
devices:
accelerator: ddp
accelerator: gpu
accumulate_grad_batches: 1 # Gradient accumulation every n batches max_epochs: 200
accelerator: ddp # Automatically set if gpus > 1
gradient_clip_val: 0.0 log_every_n_steps: 10 limit_train_batches: 1.0 # train on full dataset, can be used to toggle quick run limit_val_batches: 1.0 # validate on full dataset, can be used to toggle quick run num_sanity_val_steps: 2 # default value: 2; override to 0 to skip sanity checking