microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.18k stars 290 forks source link

NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'SparsePrivateUse1' backend. #414

Open Dreace opened 1 year ago

Dreace commented 1 year ago

I encountered this error when trying to run Whisper(https://github.com/openai/whisper) using torch-directml. Sample code:

import torch_directml
import whisper

dml = torch_directml.device()
whisper.load_model("tiny.en", dml)

Full error message:

Traceback (most recent call last):
  File "<path>\test.py", line 8, in <module>
    c, m =  whisper.load_model("t.pt", dml)
  File "<path>\venv\lib\site-packages\whisper\__init__.py", line 154, in load_model
    return model.to(device)
  File "<path>\venv\lib\site-packages\torch\nn\modules\module.py", line 990, in to
    return self._apply(convert)
  File "<path>\venv\lib\site-packages\torch\nn\modules\module.py", line 688, in _apply
    self._buffers[key] = fn(buf)
  File "<path>\venv\lib\site-packages\torch\nn\modules\module.py", line 988, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'SparsePrivateUse1' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty.memory_format' is only available for these backends: [CPU, Meta, PrivateUse1, QuantizedCPU, QuantizedMeta, MkldnnCPU, SparseCPU, SparseMeta, SparseCsrCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterCPU.cpp:30798 [kernel]
Meta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterMeta.cpp:26815 [kernel]
PrivateUse1: registered at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\generated\RegisterPrivateUse1.cpp:1758 [kernel]
QuantizedCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterQuantizedCPU.cpp:929 [kernel]
QuantizedMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterQuantizedMeta.cpp:105 [kernel]
MkldnnCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterMkldnnCPU.cpp:492 [kernel]
SparseCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseCPU.cpp:1261 [kernel]
SparseMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseMeta.cpp:249 [kernel]
SparseCsrCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseCsrCPU.cpp:1030 [kernel]
BackendSelect: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterBackendSelect.cpp:726 [kernel]
Python: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:140 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:488 [backend fallback] 
Functionalize: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\FunctionalizeFallbackKernel.cpp:291 [backend fallback]
Named: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\NegateFallback.cpp:22 [kernel]
ZeroTensor: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ZeroTensorFallback.cpp:90 [kernel]
ADInplaceOrView: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]        
AutogradCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradCUDA: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradHIP: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradXLA: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradMPS: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradIPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradXPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradHPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradVE: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradLazy: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]
AutogradPrivateUse1: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]  
AutogradPrivateUse2: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]  
AutogradPrivateUse3: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel]  
AutogradNestedTensor: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:16899 [autograd kernel] 
Tracer: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\TraceType_2.cpp:16890 [kernel]
AutocastCPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:482 [backend fallback]
AutocastCUDA: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:324 [backend fallback]
FuncTorchBatched: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\LegacyBatchingRegistrations.cpp:743 [backend fallback]
FuncTorchVmapMode: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\TensorWrapper.cpp:189 [backend fallback]
PythonTLSSnapshot: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:148 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:484 [backend fallback]PythonDispatcher: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:144 [backend fallback]
smk2007 commented 1 year ago

I was able to load the model this way: image

Does that work for you?

Dreace commented 1 year ago

I was able to load the model this way: image

Does that work for you?

Still the same problem. My setups:

OS: Windows 10 22H2 (19045.2604)
GPU: AMD 6900XT

pip packages:

absl-py==1.4.0
accelerate==0.12.0
addict==2.4.0
aenum==3.1.11
aiofiles==23.1.0
aiohttp==3.8.3
aiosignal==1.3.1
altair==4.2.2
antlr4-python3-runtime==4.9.3
anyio==3.6.2
async-timeout==4.0.2
attrs==22.2.0
basicsr==1.4.2
bcrypt==4.0.1
beautifulsoup4==4.11.2
blendmodes==2022
boltons==21.0.0
Brotli==1.0.9
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.1.1
clean-fid==0.1.29
click==8.1.3
clip==0.2.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.0.7
cryptography==39.0.2
cycler==0.11.0
deprecation==2.1.0
einops==0.4.1
entrypoints==0.4
facexlib==0.2.5
fastapi==0.94.0
ffmpeg-python==0.2.0
ffmpy==0.3.0
filelock==3.9.0
filterpy==1.4.5
flatbuffers==23.1.21
font-roboto==0.0.1
fonts==0.0.3
fonttools==4.38.0
frozenlist==1.3.3
fsspec==2023.1.0
ftfy==6.1.1
future==0.18.3
gdown==4.6.0
gfpgan==1.3.8
gitdb==4.0.10
GitPython==3.1.30
google-auth==2.16.0
google-auth-oauthlib==0.4.6
gradio==3.13.0
grpcio==1.51.1
h11==0.12.0
httpcore==0.15.0
httpx==0.23.3
huggingface-hub==0.12.0
humanfriendly==10.0
idna==2.10
imageio==2.25.0
importlib-metadata==6.0.0
inflection==0.5.1
invisible-watermark==0.1.5
Jinja2==3.1.2
jsonmerge==1.8.0
jsonschema==4.17.3
kiwisolver==1.4.4
kornia==0.6.7
lark==1.1.2
linkify-it-py==1.0.3
llvmlite==0.39.1
lmdb==1.4.0
lpips==0.1.4
Markdown==3.4.1
markdown-it-py==2.1.0
MarkupSafe==2.1.2
matplotlib==3.6.3
mdit-py-plugins==0.3.3
mdurl==0.1.2
more-itertools==9.1.0
mpmath==1.2.1
multidict==6.0.4
mutagen==1.46.0
networkx==3.0
numba==0.56.4
numpy==1.23.3
oauthlib==3.2.2
omegaconf==2.2.3
onnx==1.13.0
onnxruntime==1.13.1
open-clip-torch @ git+https://github.com/mlfoundations/open_clip.git@bb6e834e9c70d9c27d0dc3ecedeebeaeb1ffad6b
openai-whisper @ git+https://github.com/openai/whisper.git@6dea21fd7f7253bfe450f1e2512a0fe47ee2d258
opencv-contrib-python==4.7.0.68
opencv-python==4.7.0.68
orjson==3.8.6
packaging==23.0
pandas==1.5.3
paramiko==3.1.0
piexif==1.1.3
Pillow==9.4.0
protobuf==3.20.3
psutil==5.9.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pycryptodome==3.17
pycryptodomex==3.17
pydantic==1.10.4
pyDeprecate==0.3.2
pydub==0.25.1
PyNaCl==1.5.0
pyparsing==3.0.9
pyreadline3==3.4.1
pyrsistent==0.19.3
PySocks==1.7.1
python-dateutil==2.8.2
python-multipart==0.0.4
pytorch-lightning==1.7.6
pytz==2022.7.1
PyWavelets==1.4.1
PyYAML==6.0
realesrgan==0.3.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
resize-right==0.0.2
rfc3986==1.5.0
rsa==4.9
safetensors==0.2.7
scikit-image==0.19.2
scipy==1.10.0
sentencepiece==0.1.97
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
soundfile==0.12.1
soupsieve==2.3.2.post1
starlette==0.26.0.post1
sympy==1.11.1
tb-nightly==2.12.0a20230209
tensorboard==2.12.0
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tifffile==2023.2.3
tiktoken==0.3.1
timm==0.6.7
tokenizers==0.13.2
toolz==0.12.0
torch==1.13.1
torch-directml==0.1.13.1.dev230301
torchaudio==0.13.1
torchdiffeq==0.2.3
torchmetrics==0.11.1
torchsde==0.2.5
torchvision==0.14.1
tqdm==4.64.1
trampoline==0.1.2
transformers==4.25.1
typing_extensions==4.4.0
uc-micro-py==1.0.1
urllib3==1.26.14
uvicorn==0.20.0
wcwidth==0.2.6
websockets==10.4
Werkzeug==2.2.2
yapf==0.32.0
yarl==1.8.2
yt-dlp==2023.3.4
zipp==3.13.0
TechInterMezzo commented 1 year ago

Same problem for me. What is the difference between the PrivateUse1 backend and SparsePrivateUse1 backend? SparsePrivateUse1 is not in the list of the supported backends for aten::empty.memory_format. My graphics card is an AMD Radeon RX580. Is it missing features?

yestolife commented 1 year ago

I‘m also have the same problem.

chaydenfowler commented 1 year ago

DML+Whisper Workaround

Environment Details

``` OS: Windows 11 Home 22H2 (22621.1555) GPU: AMD Radeon RX 6900 XT (Driver V: 31.0.14043.7000) ``` Package management done 100% by Pip, no Conda for the purposes of this test. `pip list` ``` Package Version ------------------ --------------- blis 0.7.9 catalogue 1.0.2 certifi 2023.5.7 charset-normalizer 3.1.0 colorama 0.4.6 cymem 2.0.7 de-core-news-sm 2.3.0 dill 0.3.4 en-core-web-sm 2.3.1 ffmpeg-python 0.2.0 filelock 3.12.0 future 0.18.3 idna 3.4 importlib-metadata 6.6.0 iopath 0.1.10 Jinja2 3.1.2 llvmlite 0.40.0 MarkupSafe 2.1.2 more-itertools 9.1.0 mpmath 1.3.0 murmurhash 1.0.9 networkx 3.1 numba 0.57.0 numpy 1.24.3 openai-whisper 20230314 Pillow 9.5.0 pip 23.1.2 plac 1.1.3 portalocker 2.7.0 preshed 3.0.8 pywin32 306 regex 2023.5.5 requests 2.30.0 setuptools 56.0.0 spacy 2.3.5 srsly 1.0.6 sympy 1.11.1 thinc 7.4.6 tiktoken 0.3.1 torch 2.0.0+cpu torch-directml 0.2.0.dev230426 torchaudio 2.0.1+cpu torchdata 0.6.1 torchtext 0.14.1 torchvision 0.15.1+cpu tqdm 4.65.0 typing_extensions 4.5.0 urllib3 2.0.2 wasabi 0.10.1 zipp 3.15.0 ```

torch_directml appears to not like sparse matrices.
The simplest workaround is to to convert models to dense after loading them onto CPU.
I couldn't figure a method to do this at a global level and had to modify torch code directly.

If you have a better, more generalised workaround, please post.

Workaround

In .venv\Lib\site-packages\torch\nn\modules\module.py modify Module>to>convert from line 1139:

def convert(t):
    if convert_to_format is not None and t.dim() in (4, 5):
        return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
                    non_blocking, memory_format=convert_to_format)
    # return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
    return t.to_dense().to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

This solution probably isn't very efficient, but it gets the job done. Processing time results further in post.

Analysis

Minimum Reproducible Error

import torch
import torch_directml

dml = torch_directml.device()

# Create a sparse tensor on CPU
a = torch.tensor([[0, 2.], [3, 0]]).to_sparse_coo()
# Try move it to DML
a.to(dml)

Produces the original error described in this thread, citing NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'SparsePrivateUse1' backend.
Running .to(dml).to_sparse_coo() produces the same result.
Fixing this error is the key feature I'm hoping gets resolved.

Full Output

``` (.venv) C:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku>c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/.venv/Scripts/python.exe "c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/junk_test copy.py" Traceback (most recent call last): File "c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/junk_test copy.py", line 9, in a.to(dml) NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'SparsePrivateUse1' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty.memory_format' is only available for these backends: [CPU, Meta, PrivateUse1, QuantizedCPU, QuantizedMeta, MkldnnCPU, SparseCPU, SparseMeta, SparseCsrCPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher]. CPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterCPU.cpp:31034 [kernel] Meta: registered at /dev/null:219 [kernel] PrivateUse1: registered at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\generated\RegisterPrivateUse1.cpp:1848 [kernel] QuantizedCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterQuantizedCPU.cpp:929 [kernel] QuantizedMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterQuantizedMeta.cpp:105 [kernel] MkldnnCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterMkldnnCPU.cpp:507 [kernel] SparseCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseCPU.cpp:1379 [kernel] SparseMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseMeta.cpp:249 [kernel] SparseCsrCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterSparseCsrCPU.cpp:1128 [kernel] BackendSelect: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\build\aten\src\ATen\RegisterBackendSelect.cpp:726 [kernel] Python: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:144 [backend fallback] FuncTorchDynamicLayerBackMode: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:491 [backend fallback] Functionalize: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\FunctionalizeFallbackKernel.cpp:280 [backend fallback] Named: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback] Conjugate: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ConjugateFallback.cpp:21 [kernel] Negative: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\NegateFallback.cpp:23 [kernel] ZeroTensor: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ZeroTensorFallback.cpp:90 [kernel] ADInplaceOrView: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:63 [backend fallback] AutogradOther: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradCPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradCUDA: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradHIP: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradXLA: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradMPS: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradIPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradXPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradHPU: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradVE: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradLazy: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradMeta: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradMTIA: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradPrivateUse1: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradPrivateUse2: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradPrivateUse3: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] AutogradNestedTensor: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\VariableType_2.cpp:17476 [autograd kernel] Tracer: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\generated\TraceType_2.cpp:16726 [kernel] AutocastCPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:487 [backend fallback] AutocastCUDA: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:354 [backend fallback] FuncTorchBatched: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\LegacyBatchingRegistrations.cpp:815 [backend fallback] FuncTorchVmapMode: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\VmapModeRegistrations.cpp:28 [backend fallback] Batched: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\LegacyBatchingRegistrations.cpp:1073 [backend fallback] VmapMode: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback] FuncTorchGradWrapper: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\TensorWrapper.cpp:210 [backend fallback] PythonTLSSnapshot: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:152 [backend fallback] FuncTorchDynamicLayerFrontMode: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:487 [backend fallback] PythonDispatcher: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:148 [backend fallback] ```

Sparse Tensor Tests

Interestingly if we try use any of the other sparse tensor methods we get different results to .to_sparse_coo().
We do get mention of The operator 'aten::to_sparse_csr' is not currently supported on the DML backend and will fall back to run on the CPU..
Given .to_sparse_coo() doesn't fallback, it seems is being treated differently, or is meant to be implemented and isn't? Unsure.
DML isn't the only backend to not have sparse support right now: Add aten::empty.memory_format for SparseMPS #87886, however there's plenty of mentions of sparse in the DirectML repo, so maybe it's a WIP.

Test Sparse Tensor Methods Code

```python import torch import torch_directml dml = torch_directml.device() # Confirm DML working properly tensor1 = torch.tensor([1]).to(dml) tensor2 = torch.tensor([2]).to(dml) dml_algebra = tensor1 + tensor2 print(dml_algebra.item()) a = torch.tensor([[0, 2.], [3, 0]]).to(dml) for fun in [ # a.to_sparse_coo, a.to_sparse_csr, a.to_sparse_csc, ]: print(fun.__name__) try: fun() print('success') except Exception as e: print(e) print() for fun in [ a.to_sparse_bsr, a.to_sparse_bsc, ]: print(fun.__name__) try: fun(blocksize=(2,2)) # Not Proper Blocksize, no clue what it should be. print('success') except Exception as e: print(e) print() ```

Test Sparse Tensor Methods Output

``` (.venv) C:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku>c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/.venv/Scripts/python.exe c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/junk_test.py 3 to_sparse_csr c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/junk_test.py:20: UserWarning: The operator 'aten::to_sparse_csr' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.) fun() c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/junk_test.py:20: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\SparseCsrTensorImpl.cpp:56.) fun() Could not run 'new_compressed_tensor' from the 'privateuseone:0' device.) to_sparse_csc Could not run 'new_compressed_tensor' from the 'privateuseone:0' device.) to_sparse_bsr Could not run 'new_compressed_tensor' from the 'privateuseone:0' device.) to_sparse_bsc Could not run 'new_compressed_tensor' from the 'privateuseone:0' device.) ```

Dense Model CPU Fallback

When running whisper with the dense-only tensors, we do still get CPU fallback: The operator 'aten::repeat_interleave.Tensor' is not currently supported on the DML backend and will fall back to run on the CPU..
This is fine, but again what is strange to me is that sparse tensors aren't triggering CPU fallback.

Performance Comparison

We get good performance!
You might have concerns about performance by converting the model .to_dense().
As far as CPU times go, the comparison is approximately equal.
I didn't do any memory analysis, with the large-v2 model usually requiring ~10GB VRAM, the dense-large-v2 may require higher memory requirements.
Test is likely not a 100% fair test, but it gets the gist.

Processing times in seconds. Run on an ~1min long dialogue audio file.

Model CPU 13600k CPU Dense Only DML RX 6900XT Dense Only CUDA RTX 3080
tiny.en 2.81 2.74 1.84 1.01
base.en 4.08 4.03 2.46 1.32
small.en 16.12 9.60 3.47 2.11
medium.en 22.65 22.63 7.58 3.54
large-v2 52.55 52.70 14.11 9.02
Comparison Code

```python import torch import torch_directml import whisper import time from pprint import pprint dml = torch_directml.device() cpu = torch.device('cpu') # AND Rerun code with Whisper Dense-only change. devices = [ cpu, dml ] models = [ "tiny.en", "base.en", "small.en", "medium.en", "large-v2" ] audio_file = r"C:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku\Session_34_mono_tiny.wav" audio = whisper.load_audio(audio_file) time_results = {} for device in devices: print("DEVICE: ", device) for model_name in models: print("MODEL: ", model_name) model = whisper.load_model(model_name).to(device) t0 = time.time() result = model.transcribe(audio, fp16=False) t1 = time.time() total_time = t1-t0 print("RESULT: ", device, model_name, total_time) time_results[(device.type, model_name)] = total_time pprint(time_results) ```

whisper_test2_dense_results.txt

``` (.venv) C:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku>c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/.venv/Scripts/python.exe c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/whisper_test2.py c:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku\.venv\lib\site-packages\whisper\timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def backtrace(trace: np.ndarray): DEVICE: cpu MODEL: tiny.en RESULT: cpu tiny.en 2.741666078567505 MODEL: base.en RESULT: cpu base.en 4.02697491645813 MODEL: small.en RESULT: cpu small.en 9.596404552459717 MODEL: medium.en RESULT: cpu medium.en 22.62943172454834 MODEL: large-v2 RESULT: cpu large-v2 52.70190501213074 DEVICE: privateuseone:0 MODEL: tiny.en c:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku\.venv\lib\site-packages\whisper\decoding.py:720: UserWarning: The operator 'aten::repeat_interleave.Tensor' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.) audio_features = audio_features.repeat_interleave(self.n_group, dim=0) RESULT: privateuseone:0 tiny.en 1.8420062065124512 MODEL: base.en RESULT: privateuseone:0 base.en 2.459657669067383 MODEL: small.en RESULT: privateuseone:0 small.en 3.4719090461730957 MODEL: medium.en RESULT: privateuseone:0 medium.en 7.5803542137146 MODEL: large-v2 RESULT: privateuseone:0 large-v2 14.107894659042358 {('cpu', 'base.en'): 4.02697491645813, ('cpu', 'large-v2'): 52.70190501213074, ('cpu', 'medium.en'): 22.62943172454834, ('cpu', 'small.en'): 9.596404552459717, ('cpu', 'tiny.en'): 2.741666078567505, ('privateuseone', 'base.en'): 2.459657669067383, ('privateuseone', 'large-v2'): 14.107894659042358, ('privateuseone', 'medium.en'): 7.5803542137146, ('privateuseone', 'small.en'): 3.4719090461730957, ('privateuseone', 'tiny.en'): 1.8420062065124512} ```

whisper_test2_sparse_results.txt

``` (.venv) C:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku>c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/.venv/Scripts/python.exe c:/Users/Cabbage/Documents/Projects/Kenku-pip-only/Kenku/kenku/whisper_test2.py c:\Users\Cabbage\Documents\Projects\Kenku-pip-only\Kenku\.venv\lib\site-packages\whisper\timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def backtrace(trace: np.ndarray): DEVICE: cpu MODEL: tiny.en RESULT: cpu tiny.en 2.8103208541870117 MODEL: base.en RESULT: cpu base.en 4.076110363006592 MODEL: small.en RESULT: cpu small.en 16.11950922012329 MODEL: medium.en RESULT: cpu medium.en 22.653390884399414 MODEL: large-v2 RESULT: cpu large-v2 52.545135736465454 {('cpu', 'base.en'): 4.076110363006592, ('cpu', 'large-v2'): 52.545135736465454, ('cpu', 'medium.en'): 22.653390884399414, ('cpu', 'small.en'): 16.11950922012329, ('cpu', 'tiny.en'): 2.8103208541870117} ```

cuda.log

``` DEVICE: cpu MODEL: tiny.en RESULT: cpu tiny.en 4.1929521560668945 MODEL: base.en RESULT: cpu base.en 6.866551637649536 MODEL: small.en RESULT: cpu small.en 26.07543396949768 MODEL: medium.en RESULT: cpu medium.en 45.59091901779175 MODEL: large-v2 RESULT: cpu large-v2 114.16571044921875 DEVICE: cuda:0 MODEL: tiny.en RESULT: cuda:0 tiny.en 1.0142285823822021 MODEL: base.en RESULT: cuda:0 base.en 1.3242998123168945 MODEL: small.en RESULT: cuda:0 small.en 2.1132583618164062 MODEL: medium.en RESULT: cuda:0 medium.en 3.5388004779815674 MODEL: large-v2 RESULT: cuda:0 large-v2 9.024125337600708 ```