How to solve this error? RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend

manmani3 commented 3 years ago

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

I'm beginner of ML and trying to use some solution based on pytorch (called detectron2) When the solution inferred the image, I always got the below error.

RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

Actually, I didn't get this error and couldn't search anything about this on google. Is there anybody who knows the way to handle this?

Info: I installed the CUDA v11.1 from https://developer.nvidia.com/cuda-downloads torch version: 1.7.0 torchvision version: 0.8.0

vfdev-5 commented 3 years ago

@manmani3 Could you please test your code with the latest torchvision v0.8.1. Also, could you please provide a code snippet to reproduce the issue you have. Thanks.

AymericFerreira commented 3 years ago

I have the same problem, I'm using server node so some functionalities are not available (as access to internet). I'm using torch 1.7.1 and torchvision 0.8.1 I was thinking that maybe the fact that I create the environnement on a system without GPU can cause the error so in doubt I reinstall both torch and torchvision on the node.

My task is a basic yolov5 training, I'm using this tutorial and start the command python train.py

https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

Here is my .sh script that I send to the server

!/bin/bash

SBATCH --gres=gpu:v100:1

SBATCH --cpus-per-task=16

SBATCH --mem=32000M

SBATCH --time=00:20:00

SBATCH --output=%N-%j.out

module load python/3.7.7 source yolov5/bin/activate

pip install --force-reinstall torch==1.7.1 --no-index pip install --force-reinstall torchvision==0.8.1 --no-index

python train.py --img 640 --batch 16 --epochs 5 --data data.yaml --weights yolov5s.pt

Note that reinstalling of torch and torchvision is not necessary but I wanted to be sure to be in version 0.8.1 I also tried with python 3.8.2 but no differences

Here is the error message :

Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

Here is my full log :

Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torch==1.7.1 Collecting numpy (from torch==1.7.1) Collecting typing-extensions (from torch==1.7.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Successfully installed numpy-1.19.4 torch-1.7.1 typing-extensions-3.7.4.3 Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torchvision==0.8.1 Collecting numpy (from torchvision==0.8.1) Collecting torch (from torchvision==0.8.1) Collecting pillow-simd>=4.1.1 (from torchvision==0.8.1) Collecting typing-extensions (from torch->torchvision==0.8.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch, pillow-simd, torchvision Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Found existing installation: Pillow-SIMD 7.0.0.post3 Uninstalling Pillow-SIMD-7.0.0.post3: Successfully uninstalled Pillow-SIMD-7.0.0.post3 Found existing installation: torchvision 0.8.1 Uninstalling torchvision-0.8.1: Successfully uninstalled torchvision-0.8.1 Successfully installed numpy-1.19.4 pillow-simd-7.0.0.post3 torch-1.7.1 torchvision-0.8.1 typing-extensions-3.7.4.3 Using torch 1.7.1 CUDA:0 (Tesla V100-SXM2-32GB, 32510MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data.yaml', device='', epochs=5, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_ra:nk=-1, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp22', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1) Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/ Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0} Overriding model.yaml nc=80 with nc=2
             from  n    params  module                                  arguments
0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 Model Summary: 3520 models.common.Focus [3, 32, 3] 18560 models.common.Conv [32, 64, 3, 2] 19904 models.common.BottleneckCSP [64, 64, 1] 73984 models.common.Conv [64, 128, 3, 2] -1 1 161152 models.common.BottleneckCSP [128, 128, 3] -1 1 295424 models.common.Conv [128, 256, 3, 2] -1 1 641792 models.common.BottleneckCSP [256, 256, 3] -1 1 1180672 models.common.Conv [256, 512, 3, 2] -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] -1 1 131584 models.common.Conv [512, 256, 1, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False] -1 1 33024 models.common.Conv [256, 128, 1, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False] -1 1 147712 models.common.Conv [128, 128, 3, 2] 0 models.common.Concat [1] -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False] -1 1 590336 models.common.Conv [256, 256, 3, 2] 0 models.common.Concat [1] -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 18879 models.yolo.Detect [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] 283 layers, 7257791 parameters, 7257791 gradients, 16.8 GFLOPS

Transferred 364/370 items from yolov5s.pt Optimizer groups: 62 .bias, 70 conv.weight, 59 other ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] Image sizes 640 train, 640 test Using 8 dataloader workers Logging results to runs/train/exp22 Starting training for 5 epochs...
 Epoch   gpu_mem       box       obj       cls     total   targets  img_size
^M 0%| | 0/7 [00:00<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 0%| | 0/7 [00:01<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 71%|███████▏ | 5/7 [00:01<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 71%|███████▏ | 5/7 [00:02<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.01it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.37it/s] ^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s]^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s] Plotting labels...

Analyzing anchors... anchors/target = 6.52, Best Possible Recall (BPR) = 1.0000 Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /home/lemc2220/wheels/torchvision/tmp.26574/python-3.7/vision-0.8.1/torchvision/csrc/vision.cpp:59 [kernel] BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] AutogradOther: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback] AutogradCPU: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback] AutogradCUDA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback] AutogradXLA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback] Tracer: fallthrough registered at /pytorch/torch/csrc/jit/frontend/tracer.cpp:967 [backend fallback] Autocast: fallthrough registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [backend fallback] Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback] VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Do you have an idea to correct this problem ?

edit : I tried on colab with the exact same script and it's working. When I look at pip list on both environnement one of the difference is

torchvision 0.8.1 for my server

torchvision 0.8.1+cu101 for google colab

edit 2 : On the node torchvision version is 0.8.1+cu101 so the problem is probably not here. I was able to train my model using yolov5 docker image. So I still doesn't understand what is wrong.

basicskywards commented 3 years ago

i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.

i spent lots of time finding a solution on the internet but failed.

eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores

it's ugly, but works.

fmassa commented 3 years ago

@basicskywards do you have a minimum reproducible example that we can try out?

mheriyanto commented 3 years ago

i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.

i spent lots of time finding a solution on the internet but failed.

eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores

it's ugly, but works.

@basicskywards It's not works for me.

deema1999 commented 3 years ago

Hi , I am facing the same error , did anybody solve it?

prabhat00155 commented 3 years ago

Hi , I am facing the same error , did anybody solve it?

Are you building from latest master? It works fine for me when I use the release version, but I see this error when using torch nightly and building torchvision from master.

heyitsguay commented 3 years ago

I'm also facing the same issue, using torch==1.8.1 and torchvision==0.9.1. I guess I'll play around with different, older versions of each to see if that helps.

prabhat00155 commented 3 years ago

The problem is related to cuda version mismatch. Check your cuda version and see if you installed the correct pytorch version(https://pytorch.org/get-started/locally/).

heyitsguay commented 3 years ago

My CUDA version was correctly matched with torch and torchvision. By downgrading from CUDA 11.1 + torch 1.8.1 + torchvision 0.9.1 to CUDA 11.0 + torch 1.7.1 + torchvision 0.8.2, I was able to resolve the error.

xsacha commented 3 years ago

Getting the same issue here, with self-built pytorch + torchvision. On CUDA 11.3. Any workarounds?

fmassa commented 3 years ago

Hi,

I think the issue might be that PyTorch has dropped support some versions of CUDA, and there might have been a conflict there and you are not updating to the right torchvision build.

I'd recommend double-checking that you don't have multiple versions of PyTorch / torchvision installed in your environment, and that you are indeed getting the right versions.

If possible, I would recommend creating a new conda environment and running the installation process from scratch

xsacha commented 3 years ago

I only have a single libtorch and torchvision API (nothing from pip or conda on this machine) I have compiled myself from master using same CUDA version. They are all placed in same path.

mattpopovich commented 3 years ago

I believe by default, if you build torchvision from source, it does not build with CUDA support.

The fix for me was to build torchvision with the -DWITH_CUDA=on flag as they mention in the build instructions.

Installation From source:

cd vision
mkdir build && cd build
cmake -DTorch_DIR=/path/to/Torch/ -DWITH_CUDA=on ..
make
make install

Additional information available in these two issues I created: https://github.com/zhiqwang/yolov5-rt-stack/issues/132, https://github.com/pytorch/vision/issues/4175

fmassa commented 3 years ago

@matthewygf interesting. We should build with CUDA by default if building it via python setup.py install. I believe @xsacha was facing this issue in python?

xsacha commented 3 years ago

I was not using Python at all. Using libtorch + torchvision compiled with same CUDA version. Built torchvision as described by @mattpopovich as I followed the build instructions.

fmassa commented 3 years ago

Oh ok, so adding the -DWITH_CUDA=on should fix the issue indeed.

Can we close the issue then?

xsacha commented 3 years ago

I used that flag when I compiled (as per build instructions) and watched it build the CUDA modules. Then I ended up here with this issue.

MadanMl commented 3 years ago

i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.

i spent lots of time finding a solution on the internet but failed.

eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores

it's ugly, but works.

That works even for AMD GPU (torch version: 1.8.0 + rocm4.0.1, torchvision version: 0.9.0) Thank you @basicskywards

xsacha commented 3 years ago

Is that really a proper solution? It just does the work on the CPU instead. You can also do boxes.to(torch.device("cpu")) instead of converting to numpy and back.

mahdiabdollahpour commented 3 years ago

I have the same problem, I'm using server node so some functionalities are not available (as access to internet). I'm using torch 1.7.1 and torchvision 0.8.1 I was thinking that maybe the fact that I create the environnement on a system without GPU can cause the error so in doubt I reinstall both torch and torchvision on the node.

My task is a basic yolov5 training, I'm using this tutorial and start the command python train.py

https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

Here is my .sh script that I send to the server

!/bin/bash

SBATCH --gres=gpu:v100:1

SBATCH --cpus-per-task=16

SBATCH --mem=32000M

SBATCH --time=00:20:00

SBATCH --output=%N-%j.out

module load python/3.7.7 source yolov5/bin/activate pip install --force-reinstall torch==1.7.1 --no-index pip install --force-reinstall torchvision==0.8.1 --no-index python train.py --img 640 --batch 16 --epochs 5 --data data.yaml --weights yolov5s.pt

Note that reinstalling of torch and torchvision is not necessary but I wanted to be sure to be in version 0.8.1 I also tried with python 3.8.2 but no differences

Here is the error message :

Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

Here is my full log :
Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torch==1.7.1 Collecting numpy (from torch==1.7.1) Collecting typing-extensions (from torch==1.7.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Successfully installed numpy-1.19.4 torch-1.7.1 typing-extensions-3.7.4.3 Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torchvision==0.8.1 Collecting numpy (from torchvision==0.8.1) Collecting torch (from torchvision==0.8.1) Collecting pillow-simd>=4.1.1 (from torchvision==0.8.1) Collecting typing-extensions (from torch->torchvision==0.8.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch, pillow-simd, torchvision Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Found existing installation: Pillow-SIMD 7.0.0.post3 Uninstalling Pillow-SIMD-7.0.0.post3: Successfully uninstalled Pillow-SIMD-7.0.0.post3 Found existing installation: torchvision 0.8.1 Uninstalling torchvision-0.8.1: Successfully uninstalled torchvision-0.8.1 Successfully installed numpy-1.19.4 pillow-simd-7.0.0.post3 torch-1.7.1 torchvision-0.8.1 typing-extensions-3.7.4.3 Using torch 1.7.1 CUDA:0 (Tesla V100-SXM2-32GB, 32510MB) Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data.yaml', device='', epochs=5, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_ra:nk=-1, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp22', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1) Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/ Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0} Overriding model.yaml nc=80 with nc=2
             from  n    params  module                                  arguments
0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 24 [17, 20, 23] 1 18879 models.yolo.Detect [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7257791 parameters, 7257791 gradients, 16.8 GFLOPS Transferred 364/370 items from yolov5s.pt Optimizer groups: 62 .bias, 70 conv.weight, 59 other ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] Image sizes 640 train, 640 test Using 8 dataloader workers Logging results to runs/train/exp22 Starting training for 5 epochs...
 Epoch   gpu_mem       box       obj       cls     total   targets  img_size
^M 0%| | 0/7 [00:00<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 0%| | 0/7 [00:01<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 71%|███████▏ | 5/7 [00:01<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 71%|███████▏ | 5/7 [00:02<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.01it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.37it/s] ^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s]^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s] Plotting labels... Analyzing anchors... anchors/target = 6.52, Best Possible Recall (BPR) = 1.0000 Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode]. CPU: registered at /home/lemc2220/wheels/torchvision/tmp.26574/python-3.7/vision-0.8.1/torchvision/csrc/vision.cpp:59 [kernel] BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] AutogradOther: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback] AutogradCPU: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback] AutogradCUDA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback] AutogradXLA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback] Tracer: fallthrough registered at /pytorch/torch/csrc/jit/frontend/tracer.cpp:967 [backend fallback] Autocast: fallthrough registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [backend fallback] Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback] VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Do you have an idea to correct this problem ?

edit : I tried on colab with the exact same script and it's working. When I look at pip list on both environnement one of the difference is

torchvision 0.8.1 for my server

torchvision 0.8.1+cu101 for google colab

edit 2 : On the node torchvision version is 0.8.1+cu101 so the problem is probably not here. I was able to train my model using yolov5 docker image. So I still doesn't understand what is wrong.

Did you come to a solution?

mahdiabdollahpour commented 3 years ago

My CUDA version was correctly matched with torch and torchvision. By downgrading from CUDA 11.1 + torch 1.8.1 + torchvision 0.9.1 to CUDA 11.0 + torch 1.7.1 + torchvision 0.8.2, I was able to resolve the error.

Is it possible with CUDA 10.X ?

phucpha commented 2 years ago

i solved this issue. Install cuda suitable for pytorch and pytorch version. then uninstall pytorch and torchvision , after that install pytorch and torchvision again. Sorry because my english not good. good luck

lipond commented 2 years ago

I came across the same problem and found that was becasue the TorchVision I installed is of CPU version. I reinstalled it by "pip install torchvision==0.8.0 --force-reinstall" and solved the problem.

thiagocrepaldi commented 2 years ago

I came across the same problem and found that was becasue the TorchVision I installed is of CPU version. I reinstalled it by "pip install torchvision==0.8.0 --force-reinstall" and solved the problem.

How do you check whether it was the cpu or cuda version? I got this problem specifically for torchvision.nms on a docker image (with FORCE_CUDA=1 env var) with forced cuda support. However, when I tried another torchvision example that also used cuda device but it wasnt using torchviion.nms, it suceedded.

szigetif commented 2 years ago

I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!

weolix commented 1 year ago

Note that the CUDA version carried by pytorch cannot be higher than the highest version supported by your nVidia graphics card, see this entry in the nVidia control panel.Updating the driver solves this problem.

Haiderahandali commented 1 year ago

The issue for me was torchvision, I first installed it in my virtual environment using the requirements.txt for yolo7 I solved this by uninstalling torchvision first, then re-install it from the command for the PyTorch which include a URL specifically for cuda.

geiche735 commented 1 year ago

The issue for me was torchvision, I first installed it in my virtual environment using the requirements.txt for yolo7 I solved this by uninstalling torchvision first, then re-install it from the command for the PyTorch which include a URL specifically for cuda.

Solved my problem with YOLOv8. Thanks.

sun0717 commented 1 year ago

I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!

Solved my problem with detectron2 0.5 cuda10.2

igormahall commented 10 months ago

I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!

Thanks mate. Your solution worked for me. (I was using YOLOP2 code)

pytorch / vision

How to solve this error? RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend #3058

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

!/bin/bash

SBATCH --gres=gpu:v100:1

SBATCH --cpus-per-task=16

SBATCH --mem=32000M

SBATCH --time=00:20:00

SBATCH --output=%N-%j.out

!/bin/bash

SBATCH --gres=gpu:v100:1

SBATCH --cpus-per-task=16

SBATCH --mem=32000M

SBATCH --time=00:20:00

SBATCH --output=%N-%j.out