new backend privateuseone with "to" op

🐛 Describe the bug

when I use a new backend with the key of privateuseone, and we have implement for the op of "to" with backend like this. I think it is an error with the dispatchkey of "AutogradPrivateUse1", So I do some tests for the bakcend.

test_code

I add to_dtype func based on test cpp_extensions/open_registration_extension.cpp, link https://github.com/heidongxianhua/pytorch/commit/fdb57dac418ec849dfd7900b1b69e815840b06b5

And when run the test with python3 test_cpp_extensions_open_device_registration.py, It dose not work and got error message is here. I check the to_dtype func have been registered for PrivateUse1 backend

Fail to import hypothesis in common_utils, tests are not derandomized
Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cpu/custom_device_extension...
Emitting ninja build file /root/.cache/torch_extensions/py38_cpu/custom_device_extension/build.ninja...
Building extension module custom_device_extension...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF open_registration_extension.o.d -DTORCH_EXTENSION_NAME=custom_device_extension -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/shibo/device_type/pytorch_shibo/test/cpp_extensions -isystem /root/anaconda3/envs/shibo2/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/shibo2/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/shibo2/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/shibo2/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/shibo2/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -g -c /home/shibo/device_type/pytorch_shibo/test/cpp_extensions/open_registration_extension.cpp -o open_registration_extension.o
[2/2] c++ open_registration_extension.o -shared -L/root/anaconda3/envs/shibo2/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o custom_device_extension.so
Loading extension module custom_device_extension...
E
======================================================================
ERROR: test_open_device_registration (__main__.TestCppExtensionOpenRgistration)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_cpp_extensions_open_device_registration.py", line 78, in test_open_device_registration
    y_int32 = y.to(torch.int32)
NotImplementedError: Could not run 'aten::to.dtype' with arguments from the 'AutogradPrivateUse1' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::to.dtype' is only available for these backends: [CPU, CUDA, HIP, XLA, MPS, IPU, XPU, HPU, VE, Lazy, Meta, MTIA, PrivateUse1, PrivateUse2, PrivateUse3, FPGA, ORT, Vulkan, Metal, QuantizedCPU, QuantizedCUDA, QuantizedHIP, QuantizedXLA, QuantizedMPS, QuantizedIPU, QuantizedXPU, QuantizedHPU, QuantizedVE, QuantizedLazy, QuantizedMeta, QuantizedMTIA, QuantizedPrivateUse1, QuantizedPrivateUse2, QuantizedPrivateUse3, CustomRNGKeyId, MkldnnCPU, SparseCPU, SparseCUDA, SparseHIP, SparseXLA, SparseMPS, SparseIPU, SparseXPU, SparseHPU, SparseVE, SparseLazy, SparseMeta, SparseMTIA, SparsePrivateUse1, SparsePrivateUse2, SparsePrivateUse3, SparseCsrCPU, SparseCsrCUDA, NestedTensorCPU, NestedTensorCUDA, NestedTensorHIP, NestedTensorXLA, NestedTensorMPS, NestedTensorIPU, NestedTensorXPU, NestedTensorHPU, NestedTensorVE, NestedTensorLazy, NestedTensorMeta, NestedTensorMTIA, NestedTensorPrivateUse1, NestedTensorPrivateUse2, NestedTensorPrivateUse3, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher]

Undefined: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
CPU: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
CUDA: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
HIP: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
XLA: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
MPS: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
IPU: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
XPU: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
HPU: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
VE: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
Lazy: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
Meta: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
MTIA: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
PrivateUse1: registered at /home/shibo/device_type/pytorch_shibo/test/cpp_extensions/open_registration_extension.cpp:90 [kernel]
PrivateUse2: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
PrivateUse3: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
FPGA: registered at /home/shibo/device_type/pytorch_test/build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:7140 [math kernel]
............

Versions

Collecting environment information...
 PyTorch version: 2.0.0a0+git900db22
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 8.0.1  (based on LLVM 8.0.1)
CMake version: version 3.24.1
Libc version: glibc-2.27

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-76-generic-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

Versions of relevant libraries:
[pip3] numpy==1.24.2
[pip3] torch==2.0.0a0+git2f0b0c5
[pip3] torch-npu==2.0.0
[conda] numpy                     1.24.2                   pypi_0    pypi
[conda] torch                     2.0.0a0+git900db22          pypi_0    pypi
[conda] torch-npu                 2.0.0                    pypi_0    pypi

And I register to_dtype to the backend of AutogradPrivateUse1, It works well, the code is like this https://github.com/pytorch/pytorch/compare/master...heidongxianhua:pytorch:dispatch?expand=1 But I think it is not the expected result. As for the backend of PrivateUse1, we could not kown which op should be register to PrivateUse1 and which to AutogradPrivateUse1.Waiting for your reply.

Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to aten::to(), can you register a kernel to _to_copy()?

Context: .to() is a bit special: x.to(...) will sometimes return a fresh tensor, and some times directly return x if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for .to(). What happens instead is that have .to() decompose into _to_copy() "above" autograd, and have we have a derivative formula directly for _to_copy().

The way you can tell that this is happening is that you'll see that aten::to() is registered to the CompositeImplicitAutograd dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to the CompositeImplicitAutograd dispatch key:python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).

As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with CompositeImplicitAutograd registrations in core. Instead, you can see what operators they desugar into (here's the decomp for .to()), and write implementations for those.

Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to aten::to(), can you register a kernel to _to_copy()?

Context: .to() is a bit special: x.to(...) will sometimes return a fresh tensor, and some times directly return x if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for .to(). What happens instead is that have .to() decompose into _to_copy() "above" autograd, and have we have a derivative formula directly for _to_copy().

The way you can tell that this is happening is that you'll see that aten::to() is registered to the CompositeImplicitAutograd dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to the CompositeImplicitAutograd dispatch key:python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).

As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with CompositeImplicitAutograd registrations in core. Instead, you can see what operators they desugar into (here's the decomp for .to()), and write implementations for those.

oo thank you, that looks good. I will take a try.

Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to aten::to(), can you register a kernel to _to_copy()?

Context: .to() is a bit special: x.to(...) will sometimes return a fresh tensor, and some times directly return x if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for .to(). What happens instead is that have .to() decompose into _to_copy() "above" autograd, and have we have a derivative formula directly for _to_copy().

The way you can tell that this is happening is that you'll see that aten::to() is registered to the CompositeImplicitAutograd dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to the CompositeImplicitAutograd dispatch key:python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).

As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with CompositeImplicitAutograd registrations in core. Instead, you can see what operators they desugar into (here's the decomp for .to()), and write implementations for those.

@bdhirsh Hi, I have try what you said, it works ok, thankyou.

But there are new questions, haha. And I test the DDP module (distributed traing), when I register the op allgather_ to privateuse1 backend, it get the similar error like this. And when I register it to the Autogradprivateuse1 backend, it also work well, but it is not the expected result.

TORCH_LIBRARY_IMPL(c10d, PrivateUse1, m) {
  m.impl("allgather_", allgather_PrivateUse1_);
}

And the error message is :

  File "/home/torch2/ResNet50_for_PyTorch/DistributedResnet50/main_apex_d76_npu.py", line 533, in main_worker
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False)
  File "/root/anaconda3/envs/torch2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/anaconda3/envs/torch2/lib/python3.8/site-packages/torch/distributed/utils.py", line 206, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
NotImplementedError: Could not run 'c10d::allgather_' with arguments from the 'Autogradnpu' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allgather_' is only available for these backends: [CPU, CUDA, PrivateUse1, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at ../torch/csrc/distributed/c10d/Ops.cpp:644 [kernel]
CUDA: registered at ../torch/csrc/distributed/c10d/Ops.cpp:648 [kernel]
PrivateUse1: registered at /home/djh/ascend-pytorch/djh_pytorch_v2.0.0/torch_npu/csrc/dist_for_hccl/Ops.cpp:36 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]

@heidongxianhua hmmm...

Can you try adding this line in your registration code?

TORCH_LIBRARY_IMPL(_, AutogradPrivateUse1, m) {
  m.fallback(torch::CppFunction::makeFallthrough());
}

Some more context: that's coming from this file, where we register "fallthrough" autograd kernels to all of the in-tree autograd keys.

For backends that have the autograd fallthrough kernel, the idea is that if you have an op that only has a kernel registered to the backend key, autograd will silently be a no-op. Otherwise, if you try to run the op then autograd will error.

The original idea according to the comments in that file seems to be that external backends might not want that "default ignore autograd" behavior, for custom ops / ops that don't have autograd implementations.

I think there are a few options:

(1) Add that fallback directly in your code. This is probably the simplest path.

(2) Have the distributed ops register that autograd fallback directly (this would probably require checking in with the distributed folks about - that seems like the right behavior, if that op is not supposed to participate in autograd). That would mean that you wouldn't need to add the fallback directly.

(3) Update the file I linked to also register the autograd fallthrough to the private use keys. This would be helpful in your case, but I'm not sure how BC-breaking it would be - there could be other users of the PrivateUse1 dispatch key relying on the existing behavior of not having a fallthrough.

pytorch / pytorch