[RFC] XPU device for PyTorch

chengjunlu commented 3 years ago

🚀 Feature

This RFC proposes to add a new user visible 'XPU' device type and the corresponding Python device runtime API to PyTorch.

XPU is a device abstraction for Intel heterogeneous computation architectures, which can be mapped to CPU, GPU, FPGA, and accelerator.

For PyTorch users, XPU works as a normal torch device, just like CPU or CUDA device, but could run with Intel CPU optimization, GPU implementation, or a mixture of heterogeneous devices according to the device mapping.

The enabling of XPU contains two parts:

adding a new 'XPU' device type and Python device runtime API to PyTorch, which is the focus of this RFC;
implementation of PyTorch ops and runtime of XPU, which will be supported via a PyTorch extension, i.e. Intel Extension for PyTorch (IPEX) that registers XPU implementation with C10 and C10D registration API when being imported by PyTorch users to enable the full functionality of XPU, similar to what other PyTorch extensions do. IPEX also provides additional Python API and implementation to configure the device implementation, for example, device mapping.

Motivation

Intel is promoting oneAPI as the industry standard and delivers Intel oneAPI products (an implementation of oneAPI standard) for high-performance heterogeneous computing.

oneAPI brings a unified programming model for developers to have common developer experiences on XPU which stands for heterogeneous hardware architectures including CPUs, GPUs, FPGAs, and other accelerators.

Adding XPU as a PyTorch device is a fundamental step to allow PyTorch users to get the best performance of DL workloads on oneAPI-powered heterogeneous hardware architectures.

Pitch

For this RFC in particular, we propose the following changes:

Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.
A modular design for different device types to register their runtime (e.g: CUDA or XPU) with a common set of APIs. Therefore, PyTorch frontend code can be shared among different devices if they rely on the common runtime API.
Decouple the PyTorch frontend codes hardcoded for CUDA runtime in long term.

Additional context

Work with XPU

PyTorch users interface the XPU device as a normal device but need to import IPEX for full functionality.

import torch 
import ipex # Intel Extension for PyTorch 

xpu = torch.device('xpu') 

input = torch.randn([100]).to(xpu) 

model.to(xpu) 

output = model(input)

Python API for Device Runtime

The IPEX will register its runtime modular into PyTorch during the first import. Therefore, a new API, torch.register_runtime(name, module), will be added into PyTorch frontend, which will be introduced in another separate RFC soon.

Model code	IPEX inside
```pytorch import torch # XPU runtime will be registered at the first import import ipex # Refer to the XPU runtime API torch.xpu.current_device() torch.xpu.current_stream() ... ```	```pytorch # Register XPU runtime via PyTorch new API import sys import torch current_module = sys.modules[__name__] # New PyTorch API for runtime module registration torch.register_runtime('xpu', current_module) ... ```

XPU Device Mapping

The 'xpu' may map to different physical devices (xpu mapping). Additional APIs in IPEX, ipex.set_xpu_mapping(mapping), are recommended so that the developer can configure the mapping.

import ipex 

# XPU == Explicit mapping 
# The explicit mapping exposes all physical devices as xpu 
# each with a different ordinal 
ipex.set_xpu_mapping(mapping="explicit")

import ipex 

# XPU == Extended CPU optimization 
# The extended CPU mapping, which provides more specific optimizations 
# on CPU devices only. 

ipex.set_xpu_mapping(mapping="CPU")

import ipex 

# XPU == GPU implementation 
# The GPU device mapping, which maps XPU to GPU devices only. 

ipex.set_xpu_mapping(mapping="GPU")

import ipex 

# XPU == AUTO implementation automatically schedules the computation on 
# heterogeneous devices, e.g. CPU/GPU combined. 

ipex.set_xpu_mapping(mapping="AUTO")

Together with above, if the xpu mapping is set to CPU, the xpu device index is always limited to 0, that is, torch.device("xpu", 1) will be invalid.

ezyang commented 3 years ago

Is the code backing this public yet? It would be very helpful if we could take a look.

Jianhui-Li commented 3 years ago

@ezyang The PR#48247 is submitted to support this RFC. https://github.com/pytorch/pytorch/pull/48247 @gottbrath

ezyang commented 3 years ago

I'm referring more to the underlying code that would actually make use of this PR

Jianhui-Li commented 3 years ago

@ezyang, We are in the process of making the "xpu" support available in Intel Extension for PyTorch (https://github.com/intel/intel-extension-for-pytorch). It will register the "xpu" device and runtime to Pytorch, so the code will be included there. @gujinghui Pytorch users can use 'xpu' device like 'cpu' or 'cuda'. The runtime registration will be in a separate RFC, and any new device can use the similar runtime registration API. @ailzhang @VitalyFedyunin

ezyang commented 3 years ago

Thanks, that's the link I was looking for.

ezyang commented 3 years ago

I took a quick look at your TensorImpl at https://github.com/intel/intel-extension-for-pytorch/blob/master/torch_ipex/csrc/ipex_tensor_impl.h and I'm a bit confused. You don't seem to be adding any extra fields to the TensorImpl. So why do you need a new TensorImpl at all? Wouldn't it be simpler to make use of the basic TensorImpl; or even better, a CPU TensorImpl (if the data you're referring to lives on CPU, which it seems to be)?

jgong5 commented 3 years ago

You don't seem to be adding any extra fields to the TensorImpl. So why do you need a new TensorImpl at all? Wouldn't it be simpler to make use of the basic TensorImpl

@ezyang Thanks for the feedback. You are right - we should be able to use the default TensorImpl. Let me explain why we introduced a new IPEXTensorImpl. We intended to reuse existing PyTorch native CPU kernels from IPEX via re-dispatch. In that scenario, we need to do shallow copies of all metadata from IPEX tensor to CPU (and vice versa), except for the device and storage since IPEX tensor and CPU tensor use different device types and storage. The method TensorImpl::copy_tensor_metadata can't serve that purpose. Also, there are protected fields that we can't access from outside. So, we created IPEXTensorImpl with the method IPEXTensorImpl::copy_meta_info for that. After a double-check on the latest PyTorch code, it seems we can construct a shallow copy ourselves by calling a sequence of public methods of TensorImpl. We will consider to remove IPEXTensorImpl in the future. Copy @EikanWang in case he has more inputs.

if the data you're referring to lives on CPU, which it seems to be

It is true that the data live on the CPU, but the storage design is different. IPEX tensor adds an extra abstraction to hide the blocked layout via a custom context in its storage (or DataPtr of it).

gujinghui commented 3 years ago

Thanks, that's the link I was looking for.

@ezyang Sorry for slow response. We are working on submitting next RFC for runtime registration. Will @ you as soon as it's ready these days.

smessmer commented 3 years ago

Can torch.register_runtime be kept outside of torch and be ipex.register_runtime() instead? It would be nice if that kind of thing could be done out of tree. If not, what would we have to change in-tree at the minimum to make an out of tree implementation possible?

chengjunlu commented 3 years ago

@smessmer We are working on the RFC for the runtime registration. And your suggestion will be taken into consideration.

Jianhui-Li commented 3 years ago

@ezyang @gottbrath Can we assign an owner for the RFC and PR? It is critical for us to enable Intel GPU and other new devices.

gujinghui commented 3 years ago

@smessmer We are working on the RFC for the runtime registration. And your suggestion will be taken into consideration.

@smessmer Thanks for the questions. The motivations to keep register_runtime in the tree are:

To gracefully support out-of-source devices in PyTorch internal components, for example, DistributedDataParallel. Otherwise, outside copies of DistributedDataParallel have to be provided by extension implementations with variant changes, which hurts PyTorch user experience.
To keep the PyTorch runtime API under control of PyTorch tree. There will be amount of different API changes provided by sorts of extension vendors, which will hurt PyTorch ecosystem and user experience. We plan to unify some general runtime APIs for all extensions by checking in the torch.register_runtime()

We are preparing another PR for this topic. Maybe, we can talk more details in that page.

Thanks.

ezyang commented 3 years ago

I owe y'all a first review. Unfortunately I have to look over the extension repo first so it has been taking some time.

Jianhui-Li commented 3 years ago

I owe y'all a first review. Unfortunately I have to look over the extension repo first so it has been taking some time.

Thanks @ezyang. If needed, we can have a meeting going through the code with you so that we can explain the motivation better. Please let us know. @gottbrath

ezyang commented 3 years ago

OK, I have reviewed the design and also taken a look at the repository for some background context. I have two primary concerns. I'll lay them out below, and also give some suggestions for how we might address them. However, I want to be careful to say that these are not directives: I'm open to hearing about other approaches.

Concern 1: XPU as a name is too generic for what the Intel integration would actually implement. In your opening post, you say "XPU is a device abstraction for Intel heterogeneous computation architectures, which can be mapped to CPU, GPU, FPGA, and accelerator." But I don't see why Intel is involved here at all. You could easily imagine XPU being an abstraction for heterogenous computation that is your normal deep learning machine, that typically has both a CPU and a GPU on it. The feature would look like this:

Instead of directly allocating something on 'cpu' or 'cuda:0' device, you allocate on 'xpu:0' device
At the beginning of your training script, you configure 'xpu:0' to map to 'cpu' or 'cuda:0' (based on the command line arguments)
PyTorch translates xpu into the appropriate device based on this mapping, but then otherwise uses the normal PyTorch CPU and CUDA kernels
One hard to resolve design problem: if you say torch.empty(..., device='xpu:0').device().type() do you get cpu or xpu:0 when xpu:0 is mapped to cpu? If xpu is retained, then you have a interoperability problem with normal cpu kernels. More on this in concern 2.

Now, I know this is not Intel's intention with oneAPI: Intel's idea is that you write a single DPC++ kernel, and then can reuse it on CPU and CUDA, and the point of the integration work you are doing here is to tap into this new ecosystem of kernels (and avoid using PyTorch's preexisting CPU and CUDA kernels when there is an Intel one available). And I'm guessing you want to also have some fancy heterogenous programming smarts, where you can dynamically decide where to place devices (although I'm not really sure how much you can really do in an eager API). There's nothing wrong with this, but it's better to tell it like it is, and name the device type so it accurately reflects what it is targeting. (Not sure if dpcpp or oneapi is a better name).

One thing that makes me wonder if you want to solve the more general problem, is your statement "Decouple the PyTorch frontend codes hardcoded for CUDA runtime in long term." This is a legitimate problem, as evidenced by the work over at our fine friends at AMD, where AMD GPUs masquerade as CUDA in a custom build of PyTorch. But actually solving this problem once and for all is legitimately difficult, and I don't want a half-way solution to suck up the oxygen from the air when someone actually decides they want to solve this.

Concern 2: Interoperability with existing CPU tensors. Let's say that, whatever we name it, we have some sort of dpcpp tensor, and when you query its device, it reports that it's a dpcpp tensor. This accurately reflects the current state of the intel-extension-for-pytorch repository. Now we have a second problem, which is that when the dpcpp tensor actually lives on CPU, you want to be able to transparently interconvert them with normal PyTorch CPU tensors in O(1) without actually having to do a copy. And as you've seen while implementing things in your repo, this is pretty annoying to do, because storages in PyTorch are fixed to one device, the Tensor device is supposed to match the storage device, and your life is bad. (See also https://github.com/intel/intel-extension-for-pytorch/issues/144).

In fact, we've already had this problem with MKLDNN tensor (cc @mingfeima) and after a number of iterations, MKLDNN fixed this problem by making sure they used normal CPU tensors whenever layout exactly matched the conventional PyTorch layout convention. So there is a to_mkldnn() function, but it is only used in cases where you actually do need to do an O(n) conversion, because the MKLDNN layout is actually different.

So this gets at a fundamental difference in PyTorch between a "device" and a "set of kernels that operate on it". A device talks about the fundamental nature of the data in question, i.e. is it in CPU memory or in GPU memory, how exactly is the data laid out. Kernels, on the other hand, operate on this data, and there may be many kernels that all work on one particular layout. You can use a device to indirectly control what the set of kernels you want are, but at the end of the day, this just isn't going to work all that well, because it's an abuse of notation. I don't think I need to beat this horse too much, since I got an admission in https://github.com/pytorch/pytorch/issues/48246#issuecomment-730806285 that IpexTensor isn't really necessary.

But you still have this problem, which is you've got some cool kernels in oneAPI and you want PyTorch to use them, but maybe not by default? Here's where I get a bit fuzzy about long term strategy. If the Intel kernel for some operator is better than our normal kernel for conventional strided layout, then we should use the Intel kernel, full stop. I.e., if I have two good old fashioned CPU tensors and I call add on them I should use the Intel kernel in this case. And if the Intel kernels are worse, but you just need to integrate them with PyTorch to do some testing, maybe some sort of flag that switches kernels over to Intel will do the trick. I'd love to know more about your use case to be able to say why this sort of approach isn't appropriate.

cc @ailzhang

jgong5 commented 3 years ago

@ezyang Thank you so much for the review and comments. I am summarizing what we synced offline below.

On concern 1, we agreed to use "XPU" as the device name. oneAPI or DPCPP is an industry standard based on which any parties can provide implementations to support their HW devices. XPU is a brand name for Intel heterogeneous HW based on oneAPI programming paradigm and optimized with Intel implementations. Having XPU as the device name, instead of oneAPI or DPCPP, is more clear to users about what HW and SW implementations they are using.

On concern 2, we agreed that it is a better choice to have IPEX CPU optimizations as a PyTorch CPU acceleration path registered on native CPU device. IPEX CPU has the following motivations:

Ease-of-use: provide auto layout conversion (to address bad UX of explicit "to_mkldnn" API, explained here) and auto-mixed precision (AMP) computation (for BF16 and INT8).
Optimizations for latest CPU AI features which target upstream but can’t be upstream in time.
Support custom ops, e.g. fusion ops.

Current IPEX CPU optimization is implemented under the XPU device (arch diagram below) so that IPEX CPU is able to intercept all ATen ops and do auto layout conversion and auto mixed precision before invoking the actual implementations. For many ATen ops that it does not cover, IPEX CPU falls back to native CPU implementation via redispatch. Storage is shared on conversions between XPU tensors and CPU tensors to avoid memory copy overhead. This is where the concern arises and Edward raised it here.

With further discussion with Edward and other Facebook folks, we agreed that register IPEX CPU optimizations on the native CPU device might also meet the motivations mentioned earlier (arch diagram below):

For ease-of-use: We can probably extend torch.autocast mechanism to support the auto layout conversion and auto mixed precision. But this requires a proof-of-concept implementation to prove all the things can work out well, and also a new design to make PyTorch autocast module extensible. Edward suggested to implement an autocast module inside IPEX CPU via registering the "Autocast" dispatch key as the first step, to evaluate this idea. If it works out well, we can consider how to contribute the module back to PyTorch.
For holding optimizations before PR, IPEX CPU needs a way to override in-tree CPU kernels for existing ATen ops. Currently, registering a kernel on an existing pair of ATen op and dispatch key would cause a warning message from PyTorch. It is agreed that such an overriding behavior is reasonable and PyTorch will make it an official feature in the future.
There were also discussions on supporting view semantics for opaque-layout tensors so that mkldnn tensors can work with code that relies on the semantics, e.g. bucket storage sharing in DDP. No conclusions were made here and we can have separate discussions later.

Below is the user code with IPEX CPU optimizations registered on native CPU device. When IPEX is imported, IPEX optimized kernels are registered on the CPU device. The torch.autocast is extended with auto layout conversion and BF16 auto-mixed precision.

import torch
import torch.nn as nn
import ipex

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self._conv2d = nn.Conv2d(3, 5, 5)
        self._softmax = nn.Softmax(dim=1)
    def forward(self, input):
        conv_res = self._conv2d(input)
        res = self._softmax(conv_res_dense)
        return res

model = Model()
input = torch.randn(5, 3, 9, 9)
with torch.autocast(dtype=torch.bfloat16,
                    layout=torch.mkldnn):
    res = model(input)

As the next step, we will do proof-of-concept implementations for IPEX CPU on the native CPU device. And if things work out well, we will port the existing optimizations and also contribute things back to PyTorch.

louie-tsai commented 3 months ago

@aice-support

pytorch / pytorch