Closed rlangefe closed 3 years ago
Just an update to this, it runs fine if I switch to just 1 V100, but on 4 V100s (or even 2 V100s), it seems to break like this and I run into an invalid axis error.
For anyone who runs into this issue in the future, we did find the solution. This issue has to do with the kernel and how the machine boots up. We had to disable IOMMU passthrough for the PCI bus in our grub.cfg. After doing this, we were able to run without the issue. Seems the GPUs were having an issue tied to a feature of that architecture that doesn't apply to the P100s. It makes the GPUs not communicate successfully, which is why it worked on the single V100.
I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:
Does anyone know what might be causing it? We're using the normal COCO dataset and the provided
upsnet_resnet50_coco_4gpu.yaml
config file.