pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.04k stars 6.93k forks source link

Fix compile with nvjpeg on Windows CUDA 12 #8641

Closed atalman closed 2 weeks ago

atalman commented 2 weeks ago

Root cause of the issue

C:\actions-runner\_work\_temp\conda_environment_10772459803\lib\site-packages\torch\cuda\__init__.py:129: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org/ to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\cuda\CUDAFunctions.cpp:108.)

Hence we are seeing for cuda 12+ jobs:

torch.cuda.is_available: False

As a result its failing builder checks here: https://github.com/pytorch/builder/actions/runs/10776424717/job/29883192429

torchvision: 0.20.0.dev20240908+cu121
torch.cuda.is_available: True
torch.ops.image._jpeg_version() = 80
Is torchvision usable? True
German shepherd (cpu): 37.6%
Traceback (most recent call last):
  File "C:\actions-runner\_work\builder\builder\pytorch\builder\vision\test\smoke_test.py", line 113, in <module>
    main()
  File "C:\actions-runner\_work\builder\builder\pytorch\builder\vision\test\smoke_test.py", line 101, in main
    smoke_test_torchvision_decode_jpeg("cuda")
  File "C:\actions-runner\_work\builder\builder\pytorch\builder\vision\test\smoke_test.py", line 37, in smoke_test_torchvision_decode_jpeg
    img_jpg = decode_jpeg(img_jpg_data, device=device)
  File "C:\Jenkins\Miniconda3\envs\conda-env-10776424717\lib\site-packages\torchvision\io\image.py", line 223, in decode_jpeg
    return torch.ops.image.decode_jpegs_cuda([input], mode.value, device)[0]
  File "C:\Jenkins\Miniconda3\envs\conda-env-10776424717\lib\site-packages\torch\_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: decode_jpegs_cuda: torchvision not compiled with nvJPEG support

Driver Update issue should not prevent us to compile torchvision with full CUDA support. We can do it even with CPU instance. Hence when FORCE_CUDA flag is set, we should try to include nvjpeg module.

As a followup we should address Driver issue

pytorch-bot[bot] commented 2 weeks ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8641

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (1 Unrelated Failure)

As of commit 1249d0b09db224e052c0c30ee7ff6e87d0209ee7 with merge base 00e7fa164bfdfd302f0b471c18ce2fd6bd1a50bc (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

* [Tests / unittests-windows (3.9, windows.g5.4xlarge.nvidia.gpu, cuda, 11.8) / windows-job](https://hud.pytorch.org/pr/pytorch/vision/8641#30005828227) ([gh](https://github.com/pytorch/vision/actions/runs/10814821079/job/30005828227)) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions[bot] commented 2 weeks ago

Hey @atalman!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py