Open heidongxianhua opened 1 year ago
cc @bdhirsh
And I register to_dtype to the backend of AutogradPrivateUse1, It works well, the code is like this https://github.com/pytorch/pytorch/compare/master...heidongxianhua:pytorch:dispatch?expand=1 But I think it is not the expected result. As for the backend of PrivateUse1, we could not kown which op should be register to PrivateUse1 and which to AutogradPrivateUse1.Waiting for your reply.
Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to aten::to()
, can you register a kernel to _to_copy()
?
Context: .to()
is a bit special: x.to(...)
will sometimes return a fresh tensor, and some times directly return x
if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for .to()
. What happens instead is that have .to()
decompose into _to_copy()
"above" autograd, and have we have a derivative formula directly for _to_copy()
.
The way you can tell that this is happening is that you'll see that aten::to()
is registered to the CompositeImplicitAutograd
dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to the CompositeImplicitAutograd dispatch key:
python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).
As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with CompositeImplicitAutograd
registrations in core. Instead, you can see what operators they desugar into (here's the decomp for .to()
), and write implementations for those.
Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to
aten::to()
, can you register a kernel to_to_copy()
?Context:
.to()
is a bit special:x.to(...)
will sometimes return a fresh tensor, and some times directly returnx
if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for.to()
. What happens instead is that have.to()
decompose into_to_copy()
"above" autograd, and have we have a derivative formula directly for_to_copy()
.The way you can tell that this is happening is that you'll see that
aten::to()
is registered to theCompositeImplicitAutograd
dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to theCompositeImplicitAutograd dispatch key:
python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with
CompositeImplicitAutograd
registrations in core. Instead, you can see what operators they desugar into (here's the decomp for.to()
), and write implementations for those.
oo thank you, that looks good. I will take a try.
Hey @heidongxianhua. This interaction is a bit annoying, but the tldr: instead of registering a kernel directly to
aten::to()
, can you register a kernel to_to_copy()
?Context:
.to()
is a bit special:x.to(...)
will sometimes return a fresh tensor, and some times directly returnx
if the metadata matches and no copy is needed. Because of this "sometimes return a copy behavior", we don't have an autograd formula for.to()
. What happens instead is that have.to()
decompose into_to_copy()
"above" autograd, and have we have a derivative formula directly for_to_copy()
.The way you can tell that this is happening is that you'll see that
aten::to()
is registered to theCompositeImplicitAutograd
dispatch key in native_functions.yaml (or by running this code, and seeing that there is indeed an entry registered to theCompositeImplicitAutograd dispatch key:
python -c 'import torch; print(torch._C._dispatch_dump("aten::to.dtype"))'`).As a backend implementor, if you aren't interested in writing your own autograd formulas, then in general you don't want or need to write kernels for operators with
CompositeImplicitAutograd
registrations in core. Instead, you can see what operators they desugar into (here's the decomp for.to()
), and write implementations for those.
@bdhirsh Hi, I have try what you said, it works ok, thankyou.
But there are new questions, haha. And I test the DDP module (distributed traing), when I register the op allgather_
to privateuse1 backend, it get the similar error like this. And when I register it to the Autogradprivateuse1 backend, it also work well, but it is not the expected result.
TORCH_LIBRARY_IMPL(c10d, PrivateUse1, m) {
m.impl("allgather_", allgather_PrivateUse1_);
}
And the error message is :
File "/home/torch2/ResNet50_for_PyTorch/DistributedResnet50/main_apex_d76_npu.py", line 533, in main_worker
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False)
File "/root/anaconda3/envs/torch2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/anaconda3/envs/torch2/lib/python3.8/site-packages/torch/distributed/utils.py", line 206, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
NotImplementedError: Could not run 'c10d::allgather_' with arguments from the 'Autogradnpu' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allgather_' is only available for these backends: [CPU, CUDA, PrivateUse1, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
CPU: registered at ../torch/csrc/distributed/c10d/Ops.cpp:644 [kernel]
CUDA: registered at ../torch/csrc/distributed/c10d/Ops.cpp:648 [kernel]
PrivateUse1: registered at /home/djh/ascend-pytorch/djh_pytorch_v2.0.0/torch_npu/csrc/dist_for_hccl/Ops.cpp:36 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]
@heidongxianhua hmmm...
Can you try adding this line in your registration code?
TORCH_LIBRARY_IMPL(_, AutogradPrivateUse1, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}
Some more context: that's coming from this file, where we register "fallthrough" autograd kernels to all of the in-tree autograd keys.
For backends that have the autograd fallthrough kernel, the idea is that if you have an op that only has a kernel registered to the backend key, autograd will silently be a no-op. Otherwise, if you try to run the op then autograd will error.
The original idea according to the comments in that file seems to be that external backends might not want that "default ignore autograd" behavior, for custom ops / ops that don't have autograd implementations.
I think there are a few options:
(1) Add that fallback directly in your code. This is probably the simplest path.
(2) Have the distributed ops register that autograd fallback directly (this would probably require checking in with the distributed folks about - that seems like the right behavior, if that op is not supposed to participate in autograd). That would mean that you wouldn't need to add the fallback directly.
(3) Update the file I linked to also register the autograd fallthrough to the private use keys. This would be helpful in your case, but I'm not sure how BC-breaking it would be - there could be other users of the PrivateUse1 dispatch key relying on the existing behavior of not having a fallthrough.
🐛 Describe the bug
when I use a new backend with the key of privateuseone, and we have implement for the op of "to" with backend like this. I think it is an error with the dispatchkey of "AutogradPrivateUse1", So I do some tests for the bakcend.
test_code
I add to_dtype func based on test cpp_extensions/open_registration_extension.cpp, link https://github.com/heidongxianhua/pytorch/commit/fdb57dac418ec849dfd7900b1b69e815840b06b5
And when run the test with python3 test_cpp_extensions_open_device_registration.py, It dose not work and got error message is here. I check the to_dtype func have been registered for PrivateUse1 backend
Versions