Open manmani3 opened 3 years ago
@manmani3 Could you please test your code with the latest torchvision v0.8.1. Also, could you please provide a code snippet to reproduce the issue you have. Thanks.
I have the same problem, I'm using server node so some functionalities are not available (as access to internet). I'm using torch 1.7.1 and torchvision 0.8.1 I was thinking that maybe the fact that I create the environnement on a system without GPU can cause the error so in doubt I reinstall both torch and torchvision on the node.
My task is a basic yolov5 training, I'm using this tutorial and start the command python train.py
https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
Here is my .sh script that I send to the server
!/bin/bash
SBATCH --gres=gpu:v100:1
SBATCH --cpus-per-task=16
SBATCH --mem=32000M
SBATCH --time=00:20:00
SBATCH --output=%N-%j.out
module load python/3.7.7 source yolov5/bin/activate
pip install --force-reinstall torch==1.7.1 --no-index pip install --force-reinstall torchvision==0.8.1 --no-index
python train.py --img 640 --batch 16 --epochs 5 --data data.yaml --weights yolov5s.pt
Note that reinstalling of torch and torchvision is not necessary but I wanted to be sure to be in version 0.8.1 I also tried with python 3.8.2 but no differences
Here is the error message :
Traceback (most recent call last): File "train.py", line 512, in
train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].
Here is my full log :
Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torch==1.7.1 Collecting numpy (from torch==1.7.1) Collecting typing-extensions (from torch==1.7.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Successfully installed numpy-1.19.4 torch-1.7.1 typing-extensions-3.7.4.3 Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torchvision==0.8.1 Collecting numpy (from torchvision==0.8.1) Collecting torch (from torchvision==0.8.1) Collecting pillow-simd>=4.1.1 (from torchvision==0.8.1) Collecting typing-extensions (from torch->torchvision==0.8.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch, pillow-simd, torchvision Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Found existing installation: Pillow-SIMD 7.0.0.post3 Uninstalling Pillow-SIMD-7.0.0.post3: Successfully uninstalled Pillow-SIMD-7.0.0.post3 Found existing installation: torchvision 0.8.1 Uninstalling torchvision-0.8.1: Successfully uninstalled torchvision-0.8.1 Successfully installed numpy-1.19.4 pillow-simd-7.0.0.post3 torch-1.7.1 torchvision-0.8.1 typing-extensions-3.7.4.3 Using torch 1.7.1 CUDA:0 (Tesla V100-SXM2-32GB, 32510MB)
Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data.yaml', device='', epochs=5, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_ra:nk=-1, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp22', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1) Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/ Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0} Overriding model.yaml nc=80 with nc=2
from n params module arguments
0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 24 [17, 20, 23] 1 18879 models.yolo.Detect [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7257791 parameters, 7257791 gradients, 16.8 GFLOPS
Transferred 364/370 items from yolov5s.pt Optimizer groups: 62 .bias, 70 conv.weight, 59 other ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] Image sizes 640 train, 640 test Using 8 dataloader workers Logging results to runs/train/exp22 Starting training for 5 epochs...
Epoch gpu_mem box obj cls total targets img_size
^M 0%| | 0/7 [00:00<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 0%| | 0/7 [00:01<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 71%|███████▏ | 5/7 [00:01<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 71%|███████▏ | 5/7 [00:02<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.01it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.37it/s] ^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s]^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s] Plotting labels...
Analyzing anchors... anchors/target = 6.52, Best Possible Recall (BPR) = 1.0000 Traceback (most recent call last): File "train.py", line 512, in
train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode]. CPU: registered at /home/lemc2220/wheels/torchvision/tmp.26574/python-3.7/vision-0.8.1/torchvision/csrc/vision.cpp:59 [kernel] BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] AutogradOther: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback] AutogradCPU: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback] AutogradCUDA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback] AutogradXLA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback] Tracer: fallthrough registered at /pytorch/torch/csrc/jit/frontend/tracer.cpp:967 [backend fallback] Autocast: fallthrough registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [backend fallback] Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback] VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Do you have an idea to correct this problem ?
edit : I tried on colab with the exact same script and it's working. When I look at pip list on both environnement one of the difference is
torchvision 0.8.1 for my server
torchvision 0.8.1+cu101 for google colab
edit 2 : On the node torchvision version is 0.8.1+cu101 so the problem is probably not here. I was able to train my model using yolov5 docker image. So I still doesn't understand what is wrong.
i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.
i spent lots of time finding a solution on the internet but failed.
eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores
it's ugly, but works.
@basicskywards do you have a minimum reproducible example that we can try out?
i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.
i spent lots of time finding a solution on the internet but failed.
eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores
it's ugly, but works.
@basicskywards It's not works for me.
Hi , I am facing the same error , did anybody solve it?
Hi , I am facing the same error , did anybody solve it?
Are you building from latest master? It works fine for me when I use the release version, but I see this error when using torch nightly and building torchvision from master.
I'm also facing the same issue, using torch==1.8.1 and torchvision==0.9.1. I guess I'll play around with different, older versions of each to see if that helps.
The problem is related to cuda version mismatch. Check your cuda version and see if you installed the correct pytorch version(https://pytorch.org/get-started/locally/).
My CUDA version was correctly matched with torch and torchvision. By downgrading from CUDA 11.1 + torch 1.8.1 + torchvision 0.9.1 to CUDA 11.0 + torch 1.7.1 + torchvision 0.8.2, I was able to resolve the error.
Getting the same issue here, with self-built pytorch + torchvision. On CUDA 11.3. Any workarounds?
Hi,
I think the issue might be that PyTorch has dropped support some versions of CUDA, and there might have been a conflict there and you are not updating to the right torchvision build.
I'd recommend double-checking that you don't have multiple versions of PyTorch / torchvision installed in your environment, and that you are indeed getting the right versions.
If possible, I would recommend creating a new conda environment and running the installation process from scratch
I only have a single libtorch and torchvision API (nothing from pip or conda on this machine) I have compiled myself from master using same CUDA version. They are all placed in same path.
I believe by default, if you build torchvision from source, it does not build with CUDA support.
The fix for me was to build torchvision with the -DWITH_CUDA=on
flag as they mention in the build instructions.
Installation From source:
cd vision mkdir build && cd build cmake -DTorch_DIR=/path/to/Torch/ -DWITH_CUDA=on .. make make install
Additional information available in these two issues I created: https://github.com/zhiqwang/yolov5-rt-stack/issues/132, https://github.com/pytorch/vision/issues/4175
@matthewygf interesting. We should build with CUDA by default if building it via python setup.py install
. I believe @xsacha was facing this issue in python?
I was not using Python at all. Using libtorch + torchvision compiled with same CUDA version. Built torchvision as described by @mattpopovich as I followed the build instructions.
Oh ok, so adding the -DWITH_CUDA=on
should fix the issue indeed.
Can we close the issue then?
I used that flag when I compiled (as per build instructions) and watched it build the CUDA modules. Then I ended up here with this issue.
i got the same problem with cuda_11.1 & torch version 1.7.0 when doing inference on RetinaNet while training without any issue.
i spent lots of time finding a solution on the internet but failed.
eventually, my problem is solved by this torch.from_numpy(boxes.detach().cpu().numpy()), and the same for scores
it's ugly, but works.
That works even for AMD GPU (torch version: 1.8.0 + rocm4.0.1, torchvision version: 0.9.0) Thank you @basicskywards
Is that really a proper solution? It just does the work on the CPU instead. You can also do boxes.to(torch.device("cpu")) instead of converting to numpy and back.
I have the same problem, I'm using server node so some functionalities are not available (as access to internet). I'm using torch 1.7.1 and torchvision 0.8.1 I was thinking that maybe the fact that I create the environnement on a system without GPU can cause the error so in doubt I reinstall both torch and torchvision on the node.
My task is a basic yolov5 training, I'm using this tutorial and start the command python train.py
https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
Here is my .sh script that I send to the server
!/bin/bash
SBATCH --gres=gpu:v100:1
SBATCH --cpus-per-task=16
SBATCH --mem=32000M
SBATCH --time=00:20:00
SBATCH --output=%N-%j.out
module load python/3.7.7 source yolov5/bin/activate pip install --force-reinstall torch==1.7.1 --no-index pip install --force-reinstall torchvision==0.8.1 --no-index python train.py --img 640 --batch 16 --epochs 5 --data data.yaml --weights yolov5s.pt
Note that reinstalling of torch and torchvision is not necessary but I wanted to be sure to be in version 0.8.1 I also tried with python 3.8.2 but no differences
Here is the error message :
Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].
Here is my full log :
Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torch==1.7.1 Collecting numpy (from torch==1.7.1) Collecting typing-extensions (from torch==1.7.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Successfully installed numpy-1.19.4 torch-1.7.1 typing-extensions-3.7.4.3 Ignoring pip: markers 'python_version < "3"' don't match your environment Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/gentoo/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic Collecting torchvision==0.8.1 Collecting numpy (from torchvision==0.8.1) Collecting torch (from torchvision==0.8.1) Collecting pillow-simd>=4.1.1 (from torchvision==0.8.1) Collecting typing-extensions (from torch->torchvision==0.8.1) ERROR: torchaudio 0.6.0 has requirement torch==1.6.0, but you'll have torch 1.7.1 which is incompatible. Installing collected packages: numpy, typing-extensions, torch, pillow-simd, torchvision Found existing installation: numpy 1.19.4 Uninstalling numpy-1.19.4: Successfully uninstalled numpy-1.19.4 Found existing installation: typing-extensions 3.7.4.3 Uninstalling typing-extensions-3.7.4.3: Successfully uninstalled typing-extensions-3.7.4.3 Found existing installation: torch 1.7.1 Uninstalling torch-1.7.1: Successfully uninstalled torch-1.7.1 Found existing installation: Pillow-SIMD 7.0.0.post3 Uninstalling Pillow-SIMD-7.0.0.post3: Successfully uninstalled Pillow-SIMD-7.0.0.post3 Found existing installation: torchvision 0.8.1 Uninstalling torchvision-0.8.1: Successfully uninstalled torchvision-0.8.1 Successfully installed numpy-1.19.4 pillow-simd-7.0.0.post3 torch-1.7.1 torchvision-0.8.1 typing-extensions-3.7.4.3 Using torch 1.7.1 CUDA:0 (Tesla V100-SXM2-32GB, 32510MB) Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data.yaml', device='', epochs=5, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_ra:nk=-1, log_artifacts=False, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp22', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1) Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/ Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0} Overriding model.yaml nc=80 with nc=2
from n params module arguments
0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False] 24 [17, 20, 23] 1 18879 models.yolo.Detect [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7257791 parameters, 7257791 gradients, 16.8 GFLOPS Transferred 364/370 items from yolov5s.pt Optimizer groups: 62 .bias, 70 conv.weight, 59 other ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] ^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s]^MScanning 'lol/labels/train.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 100/100 [00:00<?, ?it/s] Image sizes 640 train, 640 test Using 8 dataloader workers Logging results to runs/train/exp22 Starting training for 5 epochs...
Epoch gpu_mem box obj cls total targets img_size
^M 0%| | 0/7 [00:00<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 0%| | 0/7 [00:01<?, ?it/s]^M 0/4 5.23G 0.1249 0.07836 0.02944 0.2327 212 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 14%|█▍ | 1/7 [00:01<00:08, 1.36s/it]^M 0/4 5.23G 0.1252 0.07844 0.02939 0.233 219 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 29%|██▊ | 2/7 [00:01<00:05, 1.00s/it]^M 0/4 5.23G 0.1261 0.08023 0.02945 0.2358 262 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 43%|████▎ | 3/7 [00:01<00:02, 1.33it/s]^M 0/4 5.23G 0.1259 0.07773 0.02946 0.2331 185 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 57%|█████▋ | 4/7 [00:01<00:01, 1.75it/s]^M 0/4 5.23G 0.1257 0.07847 0.02943 0.2336 233 640: 71%|███████▏ | 5/7 [00:01<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 71%|███████▏ | 5/7 [00:02<00:00, 2.28it/s]^M 0/4 5.23G 0.1262 0.08034 0.02943 0.2359 296 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 86%|████████▌ | 6/7 [00:02<00:00, 2.90it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.01it/s]^M 0/4 5.15G 0.1253 0.07825 0.02924 0.2327 34 640: 100%|██████████| 7/7 [00:02<00:00, 2.37it/s] ^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s]^M Class Images Targets P R mAP@.5 mAP@.5:.95: 0%| | 0/7 [00:00<?, ?it/s] Plotting labels... Analyzing anchors... anchors/target = 6.52, Best Possible Recall (BPR) = 1.0000 Traceback (most recent call last): File "train.py", line 512, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 345, in train log_imgs=opt.log_imgs if wandb else 0) File "/project/6005615/ayfer1/yolov5/test.py", line 120, in test output = non_max_suppression(inf_out, conf_thres=conf_thres, iou_thres=iou_thres, labels=lb) File "/project/6005615/ayfer1/yolov5/utils/general.py", line 337, in non_max_suppression i = torchvision.ops.nms(boxes, scores, iou_thres) # NMS File "/project/6005615/ayfer1/yolov5/yolov5/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 42, in nms return torch.ops.torchvision.nms(boxes, scores, iou_threshold) RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode]. CPU: registered at /home/lemc2220/wheels/torchvision/tmp.26574/python-3.7/vision-0.8.1/torchvision/csrc/vision.cpp:59 [kernel] BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] AutogradOther: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback] AutogradCPU: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback] AutogradCUDA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback] AutogradXLA: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback] Tracer: fallthrough registered at /pytorch/torch/csrc/jit/frontend/tracer.cpp:967 [backend fallback] Autocast: fallthrough registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [backend fallback] Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:511 [backend fallback] VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Do you have an idea to correct this problem ?
edit : I tried on colab with the exact same script and it's working. When I look at pip list on both environnement one of the difference is
torchvision 0.8.1 for my server
torchvision 0.8.1+cu101 for google colab
edit 2 : On the node torchvision version is 0.8.1+cu101 so the problem is probably not here. I was able to train my model using yolov5 docker image. So I still doesn't understand what is wrong.
Did you come to a solution?
My CUDA version was correctly matched with torch and torchvision. By downgrading from CUDA 11.1 + torch 1.8.1 + torchvision 0.9.1 to CUDA 11.0 + torch 1.7.1 + torchvision 0.8.2, I was able to resolve the error.
Is it possible with CUDA 10.X ?
i solved this issue. Install cuda suitable for pytorch and pytorch version. then uninstall pytorch and torchvision , after that install pytorch and torchvision again. Sorry because my english not good. good luck
I came across the same problem and found that was becasue the TorchVision I installed is of CPU version. I reinstalled it by "pip install torchvision==0.8.0 --force-reinstall" and solved the problem.
I came across the same problem and found that was becasue the TorchVision I installed is of CPU version. I reinstalled it by "pip install torchvision==0.8.0 --force-reinstall" and solved the problem.
How do you check whether it was the cpu or cuda version? I got this problem specifically for torchvision.nms
on a docker image (with FORCE_CUDA=1
env var) with forced cuda support. However, when I tried another torchvision example that also used cuda device but it wasnt using torchviion.nms, it suceedded.
I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!
Note that the CUDA version carried by pytorch cannot be higher than the highest version supported by your nVidia graphics card, see this entry in the nVidia control panel.Updating the driver solves this problem.
The issue for me was torchvision, I first installed it in my virtual environment using the requirements.txt for yolo7 I solved this by uninstalling torchvision first, then re-install it from the command for the PyTorch which include a URL specifically for cuda.
The issue for me was torchvision, I first installed it in my virtual environment using the requirements.txt for yolo7 I solved this by uninstalling torchvision first, then re-install it from the command for the PyTorch which include a URL specifically for cuda.
Solved my problem with YOLOv8. Thanks.
I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!
Solved my problem with detectron2 0.5 cuda10.2
I had the same problem with torch==1.7, torchvision==0.8, and torchaudio==0.7 on CUDA 10.2. Removing them and reinstalling torch==1.7.1, torchvision==0.8.2, and torchaudio==0.7.2 instead with pip solved it for me. For picking the right versions the following link was useful: https://pytorch.org/get-started/previous-versions/ Hope it helps, God bless!
Thanks mate. Your solution worked for me. (I was using YOLOP2 code)
❓ Questions and Help
Please note that this issue tracker is not a help form and this issue will be closed.
I'm beginner of ML and trying to use some solution based on pytorch (called detectron2) When the solution inferred the image, I always got the below error.
RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].
Actually, I didn't get this error and couldn't search anything about this on google. Is there anybody who knows the way to handle this?
Info: I installed the CUDA v11.1 from https://developer.nvidia.com/cuda-downloads torch version: 1.7.0 torchvision version: 0.8.0