Open Milor123 opened 1 year ago
I also encounter this issue when testing llama.
Environment:
Python: 3.10.11
PyTorch: 2.0
torch-directml: 0.2.0.dev230426
I downgraded to PyTorch 1.13 and Torch-directml 0.1.13 as documented. This error occured, too.
Same happens when trying to train yolov5 model using DirectML on AMD GPU(Radeon 6750XT 12GB). Windows 10, torch-directml-0.2.0.dev.230426.
Command:
python train.py --data VisDrone.yaml --epochs 100 --cfg yolov5n.yaml --batch-size 32
I see that it started to utilize my GPU, I see that it has finalized a first epoch and when trying to switch between epochs 0 to 1 it fails with error:
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/99 0G 0.1299 0.1296 0.04918 431 640: 100%|██████████| 405/405 [03:24<00:00, 1.98it/s]
Class Images Instances P R mAP50 mAP50-95: 0%| | 0/18 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 649, in <module>
main(opt)
File "train.py", line 538, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 361, in train
results, maps, _ = validate.run(data_dict,
File "C:\ML\yolov5\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\ML\yolov5\val.py", line 210, in run
preds, train_out = model(im) if compute_loss else (model(im, augment=augment), None)
File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ML\yolov5\models\yolo.py", line 210, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "C:\ML\yolov5\models\yolo.py", line 122, in _forward_once
x = m(x) # run
File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ML\yolov5\models\common.py", line 56, in forward
return self.act(self.bn(self.conv(x)))
File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\batchnorm.py", line 171, in forward
return F.batch_norm(
File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\functional.py", line 2450, in batch_norm
return torch.batch_norm(
RuntimeError: Cannot set version_counter for inference tensor
I also encounter this issue when testing llama.
Environment:
Python: 3.10.11 PyTorch: 2.0 torch-directml: 0.2.0.dev230426
I downgraded to PyTorch 1.13 and Torch-directml 0.1.13 as documented. This error occured, too.
Do you solve this error?
Hi all, thank you for submitting this issue. While I can't provide a timeline for resolution as the moment, please know that your feedback is valuable to us. We will follow up once we can review this issue.
I am trying use https://github.com/suno-ai/bark with DirectML over windows 11, changing the
.to(device)
to.to(dml)
according to gpu-pytorch-windows Docs in the filesgeneration.py
inbark
folder andbuild\lib\bark\
respectly. When I try run the project. I seen that the GPU started correctly but then i get the next error.I am in Python 3.9.16
Could try help me to solve this bug, I dont know that could I do.
RuntimeError: Cannot set version_counter for inference tensor
Console Output: