`fasterrcnn_resnet50_fpn` Windows GPU tests failing on CUDA 11.6

datumbox commented 2 years ago

🐛 Describe the bug

After the removal of CUDA 11.3 and the setting of 11.6 as default the tests at fasterrcnn_resnet50_fpn started failing across all python versions:

Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 775, in check_out
    _assert_expected(output, model_name, prec=prec)
  File "C:\Users\circleci\project\test\test_models.py", line 117, in _assert_expected
    torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1342, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 61 / 80 (76.2%)
Greatest absolute difference: 182.7935028076172 at index (9, 1) (up to 0.01 allowed)
Greatest relative difference: inf at index (1, 0) (up to 0.01 allowed)

The failure occurred for item [0]['boxes']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 803, in test_detection_model
    full_validation &= check_out(out)
  File "C:\Users\circleci\project\test\test_models.py", line 783, in check_out
    torch.testing.assert_close(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1342, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 20 (5.0%)
Greatest absolute difference: 0.019545435905456543 at index (19,) (up to 0.01 allowed)
Greatest relative difference: 0.026511636142229424 at index (19,) (up to 0.01 allowed)

The failure occurs only on Windows. The Linux tests pass fine.

Versions

TorchVision latest main branch

cc @atalman @malfet @ptrblck

malfet commented 2 years ago

Hmm, can it be that nvjpeg from 11.6 returns garbage?

datumbox commented 2 years ago

@malfet I don't think it is related to nvjpeg. The data loaded to the model are random tensors with fixed seed so there is no IO op in this test. I wonder if something in CUDA 11.6 could be affecting one of the kernels of Core?

ngimel commented 2 years ago

Can you get a kineto trace of the kernels run during the test with 11.3 and 11.6? Then we can compare them and see what changes. @malfet the cudnn version is the same for 11.3 and 11.6 builds?

malfet commented 2 years ago

@ngimel they should be the same, lets see if I can reproduce this failure using nightly builds of torch and vision

malfet commented 2 years ago

Ok, I did a little bit of digging:

Same CuDNN version is used between 11.3 and 11.6


(c:\Users\circleci\project\env) C:\Users\circleci>python -c "import torch;print(torch.__version__, torch.version.cuda, torch.backends.cudnn.version())"
1.13.0.dev20220915+cu116 11.6 8302

(c:\Users\circleci\project\env) C:\Users\circleci>python -c "import torch;print(torch.version, torch.version.cuda, torch.backends.cudnn.version())" 1.13.0.dev20220914+cu113 11.3 8302


But the test itself is skipped, when running on CUDA-11.3 windows, from [here](https://app.circleci.com/pipelines/github/pytorch/vision/20135/workflows/4bd14931-962b-4344-a465-f74ffe7b51e8/jobs/1636941/steps):

++ echo CUDA_VERSION is 11.3 ... test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn] SKIPPED [ 79%] test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn_v2] SKIPPED [ 79%]

ngimel commented 2 years ago

Interesting, so not a regression then?

datumbox commented 2 years ago

OK, I know this can be confusing but the tests are marked as skipped because we couldn't fully validate their output. If we are able to partially validate the results (due to the unstable sort on postprocessing) instead of marking the test as "SUCCESS" we mark it as "SKIPPED" because part of it couldn't be executed. If any of the less strict tests fail, then we mark is as "FAIL". So the execution was not really skipped on CUDA-11.3, just partially validated. On the other hand, on CUDA-11.6 the run is marked as failed because we can certainly say it doesn't match the expected result.

You can verify this by looking further up on the log:

test/test_models.py::test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-fasterrcnn_resnet50_fpn_v2]
test/test_models.py::test_detection_model[cuda-fcos_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-retinanet_resnet50_fpn]
test/test_models.py::test_detection_model[cuda-retinanet_resnet50_fpn_v2]
test/test_models.py::test_detection_model[cuda-ssd300_vgg16]
  C:\Users\circleci\project\test\test_models.py:812: RuntimeWarning: The output of test_detection_model could only be partially validated. This is likely due to unit-test flakiness, but you may want to do additional manual checks if you made significant changes to the codebase.
    warnings.warn(msg, RuntimeWarning)

We plan to fix such flakiness issues this half. @YosuaMichael is looking into revamping TorchVision's testing infra to: 1) reduce costs 2) speed up execution 3) reduce flakiness

malfet commented 2 years ago

Ok, did a bit more digging: 11.3 and 11.6 on Windows produce identical results for regular inference, but slightly different when autocast is enabled: https://github.com/pytorch/vision/blob/2f32fe82f21017c56ac8c09aa7388d0d87dca574/test/test_models.py#L799-L803

And for CUDA-11.3 (output[0]["scores"]-expected[0]["scores"]).abs().max() is 0.0104, while for 11.6 it's 0.0195, i.e. it feels like it could either be declared as flaky, or we can just increase tolerance for the failure on Windows

malfet commented 2 years ago

Checking if something like that will do: https://github.com/pytorch/vision/pull/6601

datumbox commented 2 years ago

@malfet Thanks Nikita for checking the output on Windows hardware. Increasing tolerance is a reasonable remedy. I wonder if we should escalate this to the Nvidia colleagues to investigate the underlying issue on the kernels that causes this. There could be an underlying bug affecting mixed precision ops. cc @ptrblck

datumbox commented 2 years ago

Similar breakage on maskrcnn_resnet50_fpn_v2 but on unittest_linux_gpu_py3.7 this time.

pytorch / vision

`fasterrcnn_resnet50_fpn` Windows GPU tests failing on CUDA 11.6 #6589

🐛 Describe the bug

Versions