Open datumbox opened 2 years ago
Hmm, can it be that nvjpeg from 11.6 returns garbage?
@malfet I don't think it is related to nvjpeg. The data loaded to the model are random tensors with fixed seed so there is no IO op in this test. I wonder if something in CUDA 11.6 could be affecting one of the kernels of Core?
Can you get a kineto trace of the kernels run during the test with 11.3 and 11.6? Then we can compare them and see what changes. @malfet the cudnn version is the same for 11.3 and 11.6 builds?
@ngimel they should be the same, lets see if I can reproduce this failure using nightly builds of torch and vision
Ok, I did a little bit of digging:
(c:\Users\circleci\project\env) C:\Users\circleci>python -c "import torch;print(torch.__version__, torch.version.cuda, torch.backends.cudnn.version())"
1.13.0.dev20220915+cu116 11.6 8302
(c:\Users\circleci\project\env) C:\Users\circleci>python -c "import torch;print(torch.version, torch.version.cuda, torch.backends.cudnn.version())" 1.13.0.dev20220914+cu113 11.3 8302
But the test itself is skipped, when running on CUDA-11.3 windows, from [here](https://app.circleci.com/pipelines/github/pytorch/vision/20135/workflows/4bd14931-962b-4344-a465-f74ffe7b51e8/jobs/1636941/steps):
++ echo CUDA_VERSION is 11.3 ... test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn] SKIPPED [ 79%] test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn_v2] SKIPPED [ 79%]
Interesting, so not a regression then?
OK, I know this can be confusing but the tests are marked as skipped because we couldn't fully validate their output. If we are able to partially validate the results (due to the unstable sort on postprocessing) instead of marking the test as "SUCCESS" we mark it as "SKIPPED" because part of it couldn't be executed. If any of the less strict tests fail, then we mark is as "FAIL". So the execution was not really skipped on CUDA-11.3, just partially validated. On the other hand, on CUDA-11.6 the run is marked as failed because we can certainly say it doesn't match the expected result.
You can verify this by looking further up on the log:
test/test_models.py::test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn_v2]
test/test_models.py::test_detection_model[cuda-fcos_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-retinanet_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-retinanet_resnet50_fpn_v2]
test/test_models.py::test_detection_model[cuda-ssd300_vgg16]
C:\Users\circleci\project\test\test_models.py:812: RuntimeWarning: The output of test_detection_model could only be partially validated. This is likely due to unit-test flakiness, but you may want to do additional manual checks if you made significant changes to the codebase.
warnings.warn(msg, RuntimeWarning)
We plan to fix such flakiness issues this half. @YosuaMichael is looking into revamping TorchVision's testing infra to: 1) reduce costs 2) speed up execution 3) reduce flakiness
Ok, did a bit more digging: 11.3 and 11.6 on Windows produce identical results for regular inference, but slightly different when autocast is enabled: https://github.com/pytorch/vision/blob/2f32fe82f21017c56ac8c09aa7388d0d87dca574/test/test_models.py#L799-L803
And for CUDA-11.3 (output[0]["scores"]-expected[0]["scores"]).abs().max()
is 0.0104
, while for 11.6 it's 0.0195
, i.e. it feels like it could either be declared as flaky, or we can just increase tolerance for the failure on Windows
Checking if something like that will do: https://github.com/pytorch/vision/pull/6601
@malfet Thanks Nikita for checking the output on Windows hardware. Increasing tolerance is a reasonable remedy. I wonder if we should escalate this to the Nvidia colleagues to investigate the underlying issue on the kernels that causes this. There could be an underlying bug affecting mixed precision ops. cc @ptrblck
🐛 Describe the bug
After the removal of CUDA 11.3 and the setting of 11.6 as default the tests at
fasterrcnn_resnet50_fpn
started failing across all python versions:The failure occurs only on Windows. The Linux tests pass fine.
Versions
TorchVision latest main branch
cc @atalman @malfet @ptrblck