microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.22k stars 297 forks source link

RuntimeError: Cannot set version_counter for inference tensor AMD - w11 #450

Open Milor123 opened 1 year ago

Milor123 commented 1 year ago

I am trying use https://github.com/suno-ai/bark with DirectML over windows 11, changing the .to(device) to .to(dml) according to gpu-pytorch-windows Docs in the files generation.py in bark folder and build\lib\bark\ respectly. When I try run the project. I seen that the GPU started correctly but then i get the next error.

I am in Python 3.9.16

Could try help me to solve this bug, I dont know that could I do. RuntimeError: Cannot set version_counter for inference tensor

Console Output:

python .\run.py
No GPU being used. Careful, inference might be very slow!
  0%|                                                                                                                                                | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\NoeXVanitasXJunk\bark\run.py", line 13, in <module>
    audio_array = generate_audio(text_prompt)
  File "C:\Users\NoeXVanitasXJunk\bark\bark\api.py", line 107, in generate_audio
    semantic_tokens = text_to_semantic(
  File "C:\Users\NoeXVanitasXJunk\bark\bark\api.py", line 25, in text_to_semantic
    x_semantic = generate_text_semantic(
  File "C:\Users\NoeXVanitasXJunk\bark\bark\generation.py", line 460, in generate_text_semantic
    logits, kv_cache = model(
  File "C:\Users\NoeXVanitasXJunk\miniconda3\envs\tfdml_plugin\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\NoeXVanitasXJunk\bark\bark\model.py", line 208, in forward
    x, kv = block(x, past_kv=past_layer_kv, use_cache=use_cache)
  File "C:\Users\NoeXVanitasXJunk\miniconda3\envs\tfdml_plugin\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\NoeXVanitasXJunk\bark\bark\model.py", line 121, in forward
    attn_output, prev_kvs = self.attn(self.ln_1(x), past_kv=past_kv, use_cache=use_cache)
  File "C:\Users\NoeXVanitasXJunk\miniconda3\envs\tfdml_plugin\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\NoeXVanitasXJunk\bark\bark\model.py", line 50, in forward
    q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
  File "C:\Users\NoeXVanitasXJunk\miniconda3\envs\tfdml_plugin\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\NoeXVanitasXJunk\miniconda3\envs\tfdml_plugin\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Cannot set version_counter for inference tensor
  0%|                                                                                                                                                | 0/100 [00:00<?, ?it/s]
foldl commented 1 year ago

I also encounter this issue when testing llama.

Environment:

Python: 3.10.11
PyTorch: 2.0
torch-directml: 0.2.0.dev230426

I downgraded to PyTorch 1.13 and Torch-directml 0.1.13 as documented. This error occured, too.

marisvigulis commented 1 year ago

Same happens when trying to train yolov5 model using DirectML on AMD GPU(Radeon 6750XT 12GB). Windows 10, torch-directml-0.2.0.dev.230426. Command: python train.py --data VisDrone.yaml --epochs 100 --cfg yolov5n.yaml --batch-size 32

I see that it started to utilize my GPU, I see that it has finalized a first epoch and when trying to switch between epochs 0 to 1 it fails with error:

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       0/99         0G     0.1299     0.1296    0.04918        431        640: 100%|██████████| 405/405 [03:24<00:00,  1.98it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95:   0%|          | 0/18 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 649, in <module>
    main(opt)
  File "train.py", line 538, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 361, in train
    results, maps, _ = validate.run(data_dict,
  File "C:\ML\yolov5\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\ML\yolov5\val.py", line 210, in run
    preds, train_out = model(im) if compute_loss else (model(im, augment=augment), None)
  File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ML\yolov5\models\yolo.py", line 210, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "C:\ML\yolov5\models\yolo.py", line 122, in _forward_once
    x = m(x)  # run
  File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ML\yolov5\models\common.py", line 56, in forward
    return self.act(self.bn(self.conv(x)))
  File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\modules\batchnorm.py", line 171, in forward
    return F.batch_norm(
  File "C:\ML\yolov5\venv\lib\site-packages\torch\nn\functional.py", line 2450, in batch_norm
    return torch.batch_norm(
RuntimeError: Cannot set version_counter for inference tensor
reaaer commented 1 year ago

I also encounter this issue when testing llama.

Environment:

Python: 3.10.11
PyTorch: 2.0
torch-directml: 0.2.0.dev230426

I downgraded to PyTorch 1.13 and Torch-directml 0.1.13 as documented. This error occured, too.

Do you solve this error?

Adele101 commented 1 year ago

Hi all, thank you for submitting this issue. While I can't provide a timeline for resolution as the moment, please know that your feedback is valuable to us. We will follow up once we can review this issue.