sony / nnabla-examples

Neural Network Libraries https://nnabla.org/ - Examples
Apache License 2.0
306 stars 93 forks source link

CenterNet mixed-precision training cannot work well with specific cuDNN versions #373

Open hyingho opened 1 year ago

hyingho commented 1 year ago

CenterNet mixed-precision training cannot work well with specific cuDNN versions.

How to reproduce

python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset

Error messages

2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator
2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model...
2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2}
2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|  
[Train] epoch:0/140||loss:     nan, hm_loss:     nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00:   0%|          | 6/4929 [00:06<1:33:43,  1.14s/it]^C

or

2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator
2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model...
2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2}
2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32
^M  0%|          | 0/3697 [00:00<?, ?it/s]^M  0%|          | 0/3697 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module>
    main(opt)
  File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main
    _ = trainer.update(epoch)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update
    total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient(
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient
    return self.compute_gradient(data)
  [Previous line repeated 7 more times]
  File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient
    raise RuntimeError(
RuntimeError: Something went wrong with gradient calculations.
--------------------------------------------------------------------------

How to solve

Using a newer cuDNN version solved this issue.