[RFC] Intel GPU Upstreaming

EikanWang commented 8 months ago

TL;DR

This RFC document aims to propose and discuss the upstreaming of Intel GPU support in PyTorch. Our focus is on leveraging Intel's advancements in GPU technology to enhance PyTorch's performance and versatility. This initiative begins with the torch.compile integration as a primary step and marks a significant stride towards incorporating the Intel GPU as a robust computational backend in PyTorch. The RFC outlines key components and a high-level design strategy for this integration. By aligning with PyTorch 2.5 release goals, we aim to provide Intel GPU as a Beta feature to benefit a wide range of users and applications.

Motivation

Intel GPUs significantly enhance workload performance, showcasing their strong capabilities in processing efficiency. We have obtained promising performance in Intel® Extension for PyTorch. Therefore, we upstream features and optimizations buffered in IPEX to the stock PyTorch. This will facilitate the out-of-box experience on the Intel GPU platform for users and benefit the PyTorch community.

Approach

Eventually, we will fully support Intel GPU in PyTorch for both torch.compile mode and eager mode. From an execution perspective, we will gradually achieve this goal starting with torch.compile as the initial step to align with the PyTorch 2.5 release as a Beta feature. The functionality and performance maturity will be driven by the dynamo benchmarks – HF, TIMM and TorchBench. Data-types-wise, we will support FP32, TF32, BF16, and FP16 first. Regarding other data types like INT8 and FP8, it is not within the scope of PyTorch 2.5. And we will support all these data types gradually.

In addition, we have added a dedicated dispatch key and device name to PyTorch for Intel GPU that can be found at PyTorch GitHub. Regarding the components or features that we will upstream to the stock Pytorch for Intel GPU, they will be based on the “XPU” device tag.

In summary, the scope of the PyTorch 2.5 release for Intel GPU is as follows:

Beta: torch.compile functionality and performance
- Pass applicable UTs
- Data types: FP32, TF32, BF16, and FP16
- Proved by 3 benchmarks (HF + TorchBench + TIMM) at minimum
- Larger model coverage as a stretch goal
Intel® Data Center GPU Max Series Single device and Client GPU Series
Linux and Windows
Pip only with pre-built packages @ PyTorch Download
No Libtorch

Components

Since we are taking torch.compile as the initial step to align with the PyTorch 2.5 release, we have identified the Minimum Viable Product (MVP) set. It contains five crucial components as follows:

Intel GPU Runtime – This component is the cornerstone to support other features. It will provide the device/runtime user interfaces like Stream, Event, Device, and so on.
Minimum Set of Necessary Aten Operations – Although we take a torch.compile as the initial step to support Intel GPU, we still must implement a minimum Aten operation set to support situations as follows.
- The aten operations that the Inductor backend fallback to aten like convolution, matmul, etc.
- The aten operations that need to glue the kernels produced by the Inductor like randn, empty, as_strided, etc.
OneDNN Library Integration – Regarding Intel XEON platforms, we have relied on oneDNN to deliver optimal performance for convolution and gemm operations. It is the same story for Intel GPU.
Intel GPU Backend for Inductor –Intel GPU will integrate with torch.compile stack at the Inductor level by providing a Triton-based Inductor device backend. So, it is the crucial component to ensure torch.compile to support Intel GPU.
CI/CD for Intel GPU – The CI/CD for Intel GPU customization is the infrastructure and gatekeeper to ensure the quality of all the above components.

Besides the five above crucial components, we will rely on the Intel GPU driver and SYCL to implement the Intel GPU runtime and necessary native aten operations.

Design

In this section, we present a high-level design for each component. Regarding the detailed design, please refer to the dedicated RFC for each component for more information.

Intel GPU Runtime

Basically, PyTorch has defined Device, Stream, Event, Guard, Generator, and Allocator for GPUs. In terms of Intel GPU, we will follow the design and share the source code among different GPUs as much as possible. Besides the common code that we can share with other GPUs, the Intel GPU runtime component will add some SYCL implementations specific to Intel GPU.

Please refer to the dedicated RFC for detailed design elaboration.
Minimum Set of Necessary Aten Operations

We profiled HuggingFace, TIMM, and TorchBench and collected all the ATen operations that could not run into C++/OpenMP or Triton backend. These operations include elementwise, reduction, random, concat, scan, and indexing. We will implement these Aten operations by SYCL. Before that, we will integrate the SYCL compiler into the PyTorch build system.

Please refer to the dedicated RFC for detailed design elaboration.
OneDNN Library Integration

PyTorch has integrated oneDNN as a git submodule for CPU support. For Intel GPU support, we will reuse the same oneDNN codebase. To minimize the integration effort, we intend to separately build oneDNN as a static library for Intel CPU and GPU, respectively. After that, we will statically link the two static libraries to libtorch_cpu.so and libtorch_xpu.so. This approach avoids directly modifying PyTorch code for oneDNN integration. It also allows us to produce binaries targeted specifically at CPU or GPU hardware while reusing oneDNN source code.

Please refer to the dedicated RFC for detailed design elaboration.
Intel GPU Backend for Inductor

The Inductor already has a Triton backend to support GPUs, and we have enabled Triton to support Intel GPUs. This means we can easily extend Inductor to support Intel GPUs by building on top of the existing Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support.

Please refer to the dedicated RFC for detailed design elaboration.
CI/CD for Intel GPU

To enable CI/CD for Intel GPUs, we will maximize the reuse of existing PyTorch CI/CD infrastructure and mirror workflows from other hardware. This includes adopting Docker for builds, using label-based triggers for CI/CD pipelines, and similar patterns. Intel GPU specific builds and tests will run on self-hosted runners equipped with Intel GPUs.

Please refer to the dedicated RFC for detailed design elaboration.

For a more comprehensive and detailed understanding of each component's design, we highly encourage you to explore the respective RFCs linked above. These documents provide in-depth insight and technical specifics that are crucial for a complete grasp of the proposed implementations and integrations.

Tasks

A more detailed task list is WIP.

### Intel GPU Runtime
- [x] oneAPI BaseToolkit Integration
- [x] `Device` for Intel GPU
- [x] `Stream` for Intel GPU
- [x] `Event` for Inel GPU
- [x] `Allocator` for Intel GPU
- [x] `Guard` for Intel GPU
- [x] Random Generator

### Necessary Native Aten Operation Support
- [x] Integrate XPU OPs as the third-party
- [x] SYCL Compiler Host/Device Separate Compilation
- [x] ATen Operations(Incremental): Elementwise
- [x] ATen Operations(Incremental): Reduction
- [x] ATen Operations(Incremental): Concat, Sort, Arange and Indexing
- [x] Dynamo HuggingFace Benchmark
- [x] Dynamo TIMM Benchmark
- [x] Dynamo TorchBench Benchmark

### OneDNN Library Integration
- [x] oneDNN Library for Intel GPU Integration
- [x] ATen Operations: Conv
- [x] ATen Operations: GEMM
- [ ] ATen Operations: GEMM-Fused Operations
- [ ] ATen Operations: Conv-Fused Operations

### Intel GPU Backend for Inductor
- [x] Python Wrapper Code Generation for Intel GPU
- [x] Intel GPU Backend on Top of Triton for Kernel Code Generation

### CI/CD for Intel GPU
- [x] Self-hosted Runner Hosted in Intel Developer Cloud to Be Available in PyTorch
- [x] AWS-Docker-Based CI/CD Build Task Available for Intel GPU
- [x] CI/CD Test Task Avaiable for Intel GPU

Additional context

This RFC primarily concentrates on enabling Intel GPU support for torch.compile. Additionally, we are evaluating the possibility of extending this support to eager mode through torch.compile as well. Please refer to #115545.

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

cpuhrsch commented 8 months ago

Adding this for triage review so we can discuss whether we want a new module tag for this work etc.

ezyang commented 7 months ago

What was the conclusion of the triage review discussion?

EikanWang commented 7 months ago

@ezyang we've proceeded by proposing detailed designs for each component individually. These proposals will be illustrated through pull requests (PRs), allowing us to effectively demonstrate our ideas. And we can refine PRs directly if reviewers have any comments. Our approach primarily focuses on maximizing the reuse of existing PyTorch code and designs.

From the execution perspective, the Intel GPU runtime is the prerequisite component to enable other components. So, it would be appreciated if you could help review the Intel GPU runtime PR first. As long as the Intel GPU Runtime PRs are landed, we will prioritize the PRs landing of other components. Before that, we will submit the PRs for other components for review first.

Additionally, we have developed a comprehensive roadmap aimed at aligning our efforts with the PyTorch 2.5 release timeline, positioning these features as experimental in this version. This roadmap has been reviewed and discussed with Nikita and Chris to ensure a cohesive understanding and approach.

I'll be sharing this roadmap with you on Slack for your reference and further input. If there are any aspects of your inquiry that I may have missed or if you need further clarification on any point, please feel free to let me know.

EikanWang commented 7 months ago

Adding this for triage review so we can discuss whether we want a new module tag for this work etc.

@cpuhrsch , may I know if we can add a new module tag now to triage review and on-call?

louie-tsai commented 3 months ago

@aice-support

pytorch / pytorch