ATen operator API versioning

fengyuan14 commented 4 years ago

🚀 Feature

When implementing a new out-of-source ATen backend extension for PyTorch, we find ATen operator APIs are incompatible from version to version (even among minor versions, for example v1.5.x).

We expect ATen operator API versioning could be provided to improve user experience, when out-of-source extension does not match with PyTorch (ATen operator API).

Motivation

End-user may get runtime error about ATen operator API mismatching, when they try some different PyTorch minor versions with a given Intel Extension for PyTorch. For example, Extension v0.1 is based on PyTorch v1.5.0. End-user may try extension v0.1 on PyTorch v1.5.3+, and get ATen runtime error, due to ATen operator API changes.

In addition, different workloads may get different ATen runtime error (different operators API change). ATen runtime error is good enough for extension developer, but is not friendly enough for end-user.

So intuitively, we want to raise a warning ahead of all at runtime, if some ATen operator APIs change, which is more friendly to users, and may not bring risks.

Pitch

We expect to have ATen operator API versioning for runtime check and raise a warning at extension loading time, if PyTorch ATen operator API version is not supported by extension.

P.S. We thought of checking PyTorch version only. But it would take huge efforts to investigate ATen operator API changes on all PyTorch versions (including all minor versions).

ezyang commented 4 years ago

A few thoughts:

One big missing piece to this is that when a user-level BC-compatible change is made (e.g., adding a new optional parameter), this hard breaks backend extensions. Ideally, in this situation, we would automatically introduce a compatibility wrapper that simply checks if the optional parameter is set to its default value, and raises an error if it's not (since of course the old operator doesn't support this new option). Ideally we'd also raise a (optional?) warning so you as an extension writer could easily tell what they need to update. I don't think it's reasonable to try to do something for BC-breaking changes though: user code needed to change, so of course extension code is going to need to change too.
The other problem is that many operators are "leaky", in the sense that they were originally devised as low level implementation details, and not intended to be exposed as public, and ought not be exposed to extension writers either. If these operators end up being extension points you're also likely to get broken by them. "Fortunately", usually breaking these ops also has BC implications for serialized TorchScript models, so we're likely to be more conservative with them.
I understand it's really desirable for C++ extensions to be able to modify their code to adjust for API changes. We really need to start publishing a version macro so extensions can maintain one codebase for multiple PyTorch versions.

Also cc @ailzhang for some perspective from XLA.

fengyuan14 commented 4 years ago

Thanks, @ezyang,

Let me clarify my understanding. Actually, we have two issues,

how to co-work (be compatible) with multiple PyTorch versions.
how to provide elegant exit to end-user, if not compatible.

Regarding issue 1, we understand it is hard to provide a wholly compatible solution to end-user, otherwise, either we may pay a lot of efforts to maintain several extension versions, or PyTorch provide compatible API. In my mind, issue 1 might be a long term talk.

So we hope, at current stage, issue 2 could be solved. We think there may be a clear high-level warning noticed to end-user, not varied and detailed warning for different ATen op.

Detailed warning is good enough for developer in debug mode, but not clear to end-user in release mode.

ailzhang commented 4 years ago

Yea I agree that it's probably too much to push for compatibility across multiple Pytorch versions for now (although it's a good long term goal), and a compatibility check/warning might be good enough and feasible.

Although from XLA's experience, C++ API level changes are mostly compile errors (like function signature changes), @arthuryuan1987 can you provide a few examples of runtime errors due to API changes as well? Thanks!

ezyang commented 4 years ago

Error message trick in https://github.com/pytorch/pytorch/pull/38739 might be relevant here

fengyuan14 commented 4 years ago

We thought ATen operator API includes two parts, operator signature and operator dispatch strategy. Let me show you error log separately,

operator signature mismatching (at::_embedding_bag) bool include_last_offset=False is added from PyTorch v1.5. If we use extension based on PyTorch v1.4 blindly, when loading extension, we will get,

ImportError: Tried to register multiple operators with the same name and the same overload name but different schemas: aten::_embedding_bag(Tensor weight, Tensor indices, Tensor offsets, bool scale_grad_by_freq=False, int mode=0, bool sparse=False, Tensor? per_sample_weights=None) -> (Tensor, Tensor, Tensor, Tensor) vs aten::_embedding_bag(Tensor weight, Tensor indices, Tensor offsets, bool scale_grad_by_freq=False, int mode=0, bool sparse=False, Tensor? per_sample_weights=None, bool include_last_offset=False) -> (Tensor, Tensor, Tensor, Tensor) (findOrRegisterSchema_ at /home/fengyuan/workspace/pytorch/pytorch-extension/aten/src/ATen/core/dispatch/Dispatcher.cpp:64)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fb4fe19c06c in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Dispatcher::findOrRegisterSchema_(c10::FunctionSchema&&) + 0x1a7 (0x7fb4f908ac77 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10::Dispatcher::registerSchema(c10::FunctionSchema) + 0x9e (0x7fb4f908bcee in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x7fff27 (0x7fb4f90b9f27 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::RegisterOperators::registerSchemaAndKernel_(c10::FunctionSchema, c10::RegisterOperators::Options::KernelRegistrationConfig&&) + 0xe3 (0x7fb4f90b1ad3 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::RegisterOperators::registerOp_(c10::RegisterOperators::Options&&) + 0xaf6 (0x7fb4f90b2926 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::RegisterOperators::checkSchemaAndRegisterOp_(c10::RegisterOperators::Options&&) + 0x97d (0x7fb4f90b53ed in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::RegisterAtenTypeFunctions() + 0x9d6 (0x7fb4d111aee6 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch_ipex-0.1-py3.7-linux-x86_64.egg/_torch_ipex.so)
frame #8: PyInit__torch_ipex + 0x11e (0x7fb4d10eea6e in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch_ipex-0.1-py3.7-linux-x86_64.egg/_torch_ipex.so)
<omitting python frames>
frame #55: __libc_start_main + 0xe7 (0x7fb501b15b97 in /lib/x86_64-linux-gnu/libc.so.6)

Of course, if we try to rebase extension to support PyTorch v1.5, we will generate registration code automatically, and get a compilation error, mismatching between generated code and our native implementation. That is another talk (development mode or debugging mode). Here what we want to talk is the error log is confused for end-user (release mode).

operator dispatch strategy changes. (at::tanh)

RuntimeError: Could not run 'aten::tanh' with arguments from the 'XXXTensorId' backend. 'aten::tanh' is only available for these backends: [CPUTensorId, QuantizedCPUTensorId, VariableTensorId].

We thought an API version warning may be more clear to end-user.

ailzhang commented 4 years ago

@arthuryuan1987 I see, For 1), it's actually indeed one reason why we release torch_xla packages with corresponding torch package as well (on colab and docker images). For 2), XLA's workaround is to have a fallback implementation of every op and register all ops to the backend to avoid hitting no aten::tanh is available for XX backend. This fallback is also auto generated from RegisterationDeclarations.h and it sits in generated torch_xla/csrc/aten_xla_type_default.cpp. Your case might be slightly different from XLA, just want to provide some context in case it's helpful.

ezyang commented 4 years ago

@arthuryuan1987 I imagine there are some simple rewordings of these error messages which could make things more clear for users. Do you want to submit a PR doing this? Add me as reviewer.

fengyuan14 commented 4 years ago

@ailzhang , For 1), to release our extension packages with corresponding torch package might be heavy to us. I wonder, do you release torch_xla package, only if ATen operator APIs change? Maybe, ATen operator API changes happen to PyTorch version 1.5.1, 1.5.2, 1.5.4, 1.5.8. So you will release torch_xla for these PyTorch minor versions? If so, I think it is too heavy for us. For 2), I think it is good idea of being compatible.

@ezyang , you see, call stack ( in 1). ) might be wordy for end-user. In addtion, end-user will get different call stack, if we have several ATen operator API changes. Yes, I can submit a PR.

ezyang commented 4 years ago

Being able to release a single package for multiple minor versions of ATen is going to be a hard path to go down. We historically have made ZERO abi compatibility guarantees, even across minor versions, and infrastructurally speaking we're not setup to do this in the future. If it makes you feel better, we don't release minor versions that often, so it is essentially just having to do major version releases.

fengyuan14 commented 4 years ago

Agree with what you talk on compatibility among minor versions. If ATen API compatibility only breaks on major version releases, it will be good to backend extensions. We always release extension packages separately for each major versions (PyTorch v1.4, v1.5, v1.6).

pytorch / pytorch