Getting NaN for loss/accuracy values on 4 GPU config file

rlangefe commented 3 years ago

I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:

2021-03-17 12:21:19,650 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:19,653 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,132 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,133 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,916 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,920 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,335 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,336 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,785 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,787 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,481 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,483 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,868 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,869 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,195 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,196 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:17,611 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,966 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,968 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,389 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,390 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,071 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,072 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,507 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,508 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,329 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,330 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,332 | callback.py | line 40 : Batch [20]    Speed: 0.83 samples/sec Train-rpn_cls_loss=283.769145,  rpn_bbox_loss=nan,  rcnn_accuracy=nan,  cls_loss=nan,   bbox_loss=nan,  mask_loss=24.724522,    fcn_loss=nan,   fcn_roi_loss=1.313301,  panoptic_accuracy=0.267114, panoptic_loss=27.402849,

Does anyone know what might be causing it? We're using the normal COCO dataset and the provided upsnet_resnet50_coco_4gpu.yaml config file.

rlangefe commented 3 years ago

Just an update to this, it runs fine if I switch to just 1 V100, but on 4 V100s (or even 2 V100s), it seems to break like this and I run into an invalid axis error.

rlangefe commented 3 years ago

For anyone who runs into this issue in the future, we did find the solution. This issue has to do with the kernel and how the machine boots up. We had to disable IOMMU passthrough for the PCI bus in our grub.cfg. After doing this, we were able to run without the issue. Seems the GPUs were having an issue tied to a feature of that architecture that doesn't apply to the P100s. It makes the GPUs not communicate successfully, which is why it worked on the single V100.

uber-research / UPSNet

Getting NaN for loss/accuracy values on 4 GPU config file #141