Open hyingho opened 1 year ago
CenterNet mixed-precision training cannot work well with specific cuDNN versions.
master
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0
cuda=11.0.3, CUDNN_VERSION=8.0.5.39
python src/main.py ctdet --config_file=cfg/resnet_18_coco_mp.yaml --data_dir path_to_coco_dataset
2023-03-02 06:18:26,839 [nnabla][INFO]: Using DataIterator 2023-03-02 06:18:26,865 [nnabla][INFO]: Creating model... 2023-03-02 06:18:26,865 [nnabla][INFO]: {'hm': 80, 'wh': 2, 'reg': 2} 2023-03-02 06:18:26,865 [nnabla][INFO]: batch size per gpu: 24 [Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: -0.0000, hm_loss:245.3517, wh_loss: 28.8467, off_loss: 28.8467, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss:299.5544, hm_loss:296.1249, wh_loss: 29.4914, off_loss: 29.4914, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 30.1704, off_loss: 30.1704, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.1151, off_loss: 21.1151, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 24.2714, off_loss: 24.2714, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%| [Train] epoch:0/140||loss: nan, hm_loss: nan, wh_loss: 21.7357, off_loss: 21.7357, lr:1.00e-04, scale:4.00e+00: 0%| | 6/4929 [00:06<1:33:43, 1.14s/it]^C
or
2023-03-02 05:47:38,953 [nnabla][INFO]: Using DataIterator 2023-03-02 05:47:38,959 [nnabla][INFO]: Creating model... 2023-03-02 05:47:38,959 [nnabla][INFO]: {'hm': 80, 'reg': 2, 'wh': 2} 2023-03-02 05:47:38,964 [nnabla][INFO]: batch size per gpu: 32 ^M 0%| | 0/3697 [00:00<?, ?it/s]^M 0%| | 0/3697 [00:04<?, ?it/s] Traceback (most recent call last): File "nnabla-examples/object-detection/centernet/src/main.py", line 147, in <module> main(opt) File "nnabla-examples/object-detection/centernet/src/main.py", line 112, in main _ = trainer.update(epoch) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 191, in update total_loss, hm_loss, wh_loss, off_loss = self.compute_gradient( File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 178, in compute_gradient return self.compute_gradient(data) [Previous line repeated 7 more times] File "nnabla-examples/object-detection/centernet/src/lib/trains/ctdet.py", line 175, in compute_gradient raise RuntimeError( RuntimeError: Something went wrong with gradient calculations. --------------------------------------------------------------------------
Using a newer cuDNN version solved this issue.
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0
cuda=11.6.0, CUDNN_VERSION=8.4.0.27
CenterNet mixed-precision training cannot work well with specific cuDNN versions.
How to reproduce
master
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda110-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)cuda=11.0.3, CUDNN_VERSION=8.0.5.39
Error messages
or
How to solve
Using a newer cuDNN version solved this issue.
nnabla/nnabla-ext-cuda-multi-gpu:py310-cuda116-mpi3.1.6-v1.34.0
as the base image and install the necessary packages. (see https://github.com/sony/nnabla-examples/blob/master/object-detection/centernet/requirements.txt)cuda=11.6.0, CUDNN_VERSION=8.4.0.27