FloatingPointError: Loss became infinite or NaN at iteration=0!

ukpkhkk commented 2 months ago

Thanks for your amazing work！Besides，I met some problems when I was tring to reproduce the results in paper.I tried to train model with ddp but I received an error.My script as follow:

ngpus=$(nvidia-smi --list-gpus | wc -l)
python train_continual.py --num-gpus ${ngpus} --config-file configs/ade20k/panoptic-segmentation/100-10.yaml \
CONT.TASK 1 SOLVER.BASE_LR 0.0001 SOLVER.MAX_ITER 160000 OUTPUT_DIR ./output/ps/100-10/step1

for t in 2 3 4 5 6; do
  python train_continual.py --num-gpus ${ngpus} --config-file configs/ade20k/panoptic-segmentation/100-10.yaml \
  CONT.TASK ${t} SOLVER.BASE_LR 0.00005 SOLVER.MAX_ITER 10000 OUTPUT_DIR ./output/ps/100-10/step${t}
done

I successfully used DDP to train the model on base classes，but I received loss become NaN error when I was tring to train model on incremental classes. The error information as follow:

ERROR [07/28 16:49:32 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/public/home/experiment/detectron2/detectron2/engine/train_loop.py", line 155, in train
    self.run_step()
  File "/public/home/experiment/detectron2/detectron2/engine/defaults.py", line 498, in run_step
    self._trainer.run_step()
  File "/public/home/experiment/BalConpas/continual/train_loop.py", line 308, in run_step
    self._write_metrics(loss_dict, data_time)
  File "/public/home/experiment/BalConpas/continual/train_loop.py", line 178, in _write_metrics
    SimpleTrainer.write_metrics(loss_dict, data_time, prefix)
  File "/public/home/experiment/BalConpas/continual/train_loop.py", line 214, in write_metrics
    raise FloatingPointError(
FloatingPointError: Loss became infinite or NaN at iteration=0!
loss_dict = {'loss_ce': 3.95108699798584, 'loss_mask': 1.4035845771431923, 'loss_dice': 0.6953499168157578, 'loss_med_tokens': nan, 'loss_ce_0': 2.8116466999053955, 'loss_mask_0': 1.0946932435035706, 'loss_dice_0': 0.9748239517211914, 'loss_med_tokens_0': nan, 'loss_ce_1': 3.801536440849304, 'loss_mask_1': 1.5072991624474525, 'loss_dice_1': 0.7356392443180084, 'loss_med_tokens_1': nan, 'loss_ce_2': 3.9190744161605835, 'loss_mask_2': 1.437846951186657, 'loss_dice_2': 0.690296858549118, 'loss_med_tokens_2': nan, 'loss_ce_3': 3.9592268466949463, 'loss_mask_3': 1.4620409235358238, 'loss_dice_3': 0.6570306122303009, 'loss_med_tokens_3': nan, 'loss_ce_4': 4.005005478858948, 'loss_mask_4': 1.545292891561985, 'loss_dice_4': 0.6310994476079941, 'loss_med_tokens_4': nan, 'loss_ce_5': 3.9652328491210938, 'loss_mask_5': 1.5718111842870712, 'loss_dice_5': 0.6893632113933563, 'loss_med_tokens_5': nan, 'loss_ce_6': 3.9684624671936035, 'loss_mask_6': 1.485484890639782, 'loss_dice_6': 0.6558998078107834, 'loss_med_tokens_6': nan, 'loss_ce_7': 3.8589271306991577, 'loss_mask_7': 1.4487100467085838, 'loss_dice_7': 0.7139590978622437, 'loss_med_tokens_7': nan, 'loss_ce_8': 3.923025965690613, 'loss_mask_8': 1.4663280546665192, 'loss_dice_8': 0.7021051645278931, 'loss_med_tokens_8': nan}

And my conda enviorment is torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 ,python==3.8.17. I am looking forward to your early reply!

jinpeng0528 commented 2 months ago

How many GPUs are you using for training?

I recall encountering a similar situation during my previous experiments. It might be because the samples assigned to a single GPU were all replay samples. I’ve only used up to 2 GPUs before, so this situation rarely occurred.

If you are using 4 or more GPUs simultaneously, I guess the probability of this happening might be higher. Could you try training with 2 GPUs to see if it works successfully?

ukpkhkk commented 2 months ago

Thanks for your reply! I found this error when I used 4 GPUs to train the model.I have tried training model with 2 GPUs but I still got the same error.

jinpeng0528 commented 2 months ago

I’ll try running with 4 GPUs on my side to see if I can reproduce this error. Please give me a day or two.

Additionally, I noticed that your PyTorch version is different from mine; I’m using version 1.10.1. Also, I’m not sure what your CUDA version is, but it’s best to use 11.3, as Detectron2 is likely sensitive to this. If you have the time, you might want to adjust this part as well.

ukpkhkk commented 2 months ago

Hi！Have you reproduced this error? I changed my pytorch version to 1.10.1 but it didn't work. I can only train the model on incremental classes with one GPU without the error. My CUDA version is always 11.3 just like what is written in your README.

jinpeng0528 commented 2 months ago

Thank you for your patience. I’ve identified the issue—it was due to the previous code being unable to load parameters for model_old in a multi-GPU setup. I have modified the code (continual/trainer.py), and successfully completed the second step of training on 4 GPUs. You can try running this code to see if it works for you.

Thank you once again for your attention to our work.

jinpeng0528 / BalConpas

FloatingPointError: Loss became infinite or NaN at iteration=0! #1