[RFC] Intel GPU integration

Zantares commented 10 months ago

This RFC is from Intel, following XLA GPU roadmap to integrate Intel GPU as in-tree device in OpenXLA. Hope to get your suggestion for this proposal, thanks!

Zantares commented 10 months ago

Hi all, thanks for all of your discussion, I'm glad to see so many suggestions!

Here are some updates/plans for future work after simple talked with Google:

We agree to eliminate the #ifdef in code and upstream the common changes first, since it's very simple and can help to parameterize pass API to support more pluggable device in future
Still need more discussion for runtime part. We can continue to discuss it in this RFC, or wait to see if the maintainer can give more info
We have set public CI demo in public repo Intel® Extension for OpenXLA*, and plan to hook it to OpenXLA CI for any possible testing. Welcome to check it!

penpornk commented 9 months ago

Hi all,

Sorry for the delayed response! I somehow missed that the RFC was opened until now. Below is what I understood from the discussion:

Intel's motivations for upstreaming some parts of the Intel GPU PJRT plug-in:

Integrating at PJRT level has no easy way to reuse XLA HLO optimization passes.
- Intel has to hard-copy 90% of the code into the plug-in's repo and periodically resync. The actual changes to the passes are minor and are not worth this much maintenance effort.
Some code isn't specific to Intel GPU. Upstreaming them would benefit more devices, i.e., SYCL and SPIR-V devices.
Out of sight, out of mind. According to Intel:
1. Users may be more aware of Intel GPU support if it is in-tree.
  - I think this is orthogonal. We could reach similar user-awareness if Intel GPU support is listed in XLA's main page/documentations/blogs, etc.
2. In-tree developers may be more likely to factor in Intel GPU programming model in their new feature development/design if the code is in-tree.
  - This seems addressable with a good shared interface, which is orthogonal to where the code lives.
3. On the other hand, it's also easy for plug-in code to develop things that don't align to in-tree code in a long-term in favor of faster development velocity.

So far, we seem to all agree that:

#ifdefs interspersed in code is bad and should be avoided as much as possible.
- As @stellaraccident pointed out, it's also detrimental to motivation 3.2, as it makes the code diverge and has nothing to enforce developers to think of the shared infrastructure. Having a good shared abstraction/interface is beneficial to motivation 3.2.
XLA:GPU needs to be more modularized and have more integration points.
It makes sense to upstream device-generic code.

There are also different levels of in-tree-ness/modularity that should be distinguished:

T1: In-tree, code interspersed with #ifdefs: A resounding no from everyone.
T2: In-tree, mostly modularized backend code:
- Separate files implementing/extending a shared interface, with the module registration happening when build with some Bazel config flag.
T3: Out-of-tree as a separate repo, still a single binary:
- XLA depends on this repo as a Bazel http_repository.
- Builds as a single XLA binary with some Bazel config flag.
T4: Out-of-tree as a separate repo, separate plug-in binary to be installed alongside XLA:
- This is what the current Intel GPU PJRT plug-in does.

If I understand correctly, Intel proposes upstreaming 3 separate components T2-style, and will have fast-moving, Intel-specific features (not mentioned in this RFC) remain in the plug-in (T4). @joker-eph initially proposed the 3 components should be done T3-style, but later on seemed to be okay with T2 as well, did I get this right?

Let's decide on each component separately?

LLVM IR changes
1. Integrate SPIR-V translator: I think this should be upstreamed since it benefits any SPIR-V-supported devices.
  - @jpienaar do you have a requirement for it to be a pluggable pass at the end?
2. The rest of the changes seem specific to Intel GPU and we'll likely need to look at the code to see how we can make it more parameterized/modularized.
Lib call: The lib call rewrites happen in HLO optimization passes.
- As mentioned in our CPU Strategy, we are defining a formal integration point for 3rd-party HLO/MLIR-based graph rewrite/optmization passes. The same work applies to XLA:GPU. We would also like to learn what improvements we can make to existing passes to make them more reusable, and if we need to add more extension points (and where).
- @Zantares Could you please give some code pointers to the changes you made to the existing HLO passes in your plug-in repo? We could schedule a public meeting call to walk through the code as well if it's easier.
XLA:GPU Runtime: For calling SYCL-supported devices.
- Based on the device-generic principle, it seems like it should be upstreamed, so other devices can build upon this interface.
- 3,000 lines of code is nontrivial, but if it is in a separate backend folder that should be modular enough.
- One concern is about maintenance, as Google doesn't have the expertise to maintain it. TensorFlow used to have in-tree SYCL support but we deleted it [issue, commit] because we weren't aware of users, and there was no activities on that code in a while.
- I don't have a preference between out-of-tree (T3/T4) vs in-tree (T2) and delete if things are inactive -- I'm assuming Intel will help maintain the backend. What do people think?

joker-eph commented 9 months ago

Thanks for the great summary @penpornk!

@joker-eph initially proposed the 3 components should be done T3-style, but later on seemed to be okay with T2 as well, did I get this right?

I think there is a nuance :) To me all the shared components / library probably makes sense in-tree: that is T2 (even though: things could also be modular enough to be their own repo as well: imagine the SPIRV runtime in its own repo, with clean enough API that any SPIRV platform could just reuse it directly? The drawback is that a constellation of small components can become harder to manage/integrate/evolve).

On the other hand for the "non shared components", it's not clear to me why T2 is the best, it seems worth evaluating more closely if every platform could have its own repo for the non-shared code (without duplicating code: that would defeat the point of course), that is T4 I think? (which is how the TPU compiler is setup somehow).

Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?

stellaraccident commented 9 months ago

I don't have a preference between out-of-tree (T3/T4) vs in-tree (T2) and delete if things are inactive -- I'm assuming Intel will help maintain the backend. What do people think?

I think that for non NVIDIA GPU targets, it is a very non abstract (and primarily, non technically driven) desire to be in tree: in my experience, core developers for this kind of system -- which has grown in a non modular fashion against one target, benefit from a tangible itch to scratch. Putting the burden on those who come second or third to take the step to live outside of the peripheral vision of the primary authors is unfair and will not produce a good result with the style of development that the core authors do on this component. Case in point: it was three weeks before someone affiliated with primary stewardship of the project commented (and thank you for doing so, Penporn -- much appreciated). With that kind of dynamic, code locality is the only lever available to bias towards equal access to the platform, and the option to develop in tree should be retained.

I think that the GPU vendors should be colocated at this point: if one is in tree, then they all should be as a matter of policy. That can be reevaluated if the primary authors publish a more detailed plan and level of support track record for this kind of situation. It would be different if this were the fourth or fifth implementation and/or there was a strong track record of multi targeting for the GPU component specifically. But in effect, this is the second. The only way to evolve the situation from this point is colocation, imo.

Zantares commented 9 months ago

Thanks for the summary @penpornk ! We will have some internal discussion first and give the feedback later.

penpornk commented 9 months ago

Thank you all for the quick replies! :) The tally so far:

Proposal	Shared code (multi-vendor)	Vendor-specific code
Intel	T2	T2
@joker-eph	T2 (T3 if/when makes sense)	T4
@stellaraccident	T2	T2

I also got feedback from JAX/PJRT. The code structure underneath doesn't matter to them, they only require that the device support for JAX still be built and packaged as a separate PJRT plug-in (even if the code lives in the same XLA tree).

Our team will look more into the Intel GPU plug-in code before making per-component suggestions too. The goal is to still keep XLA in good health with reasonable work scopes that fits each party's high-level goals and timelines.

Re: @joker-eph:

Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?

Great question. @Zantares, do you have a quick answer here? Starting with clearly-T2 things makes sense to me. Maybe we can start with the LLVM-SPIRV translator PR?

@joker-eph Do you consider the SYCL runtime support "clearly T2"?

Re: @stellaraccident:

Putting the burden on those who come second or third to take the step to live outside of the peripheral vision of the primary authors is unfair and will not produce a good result with the style of development that the core authors do on this component.

I agree. Device vendors may help with core code however they can out of goodwill / necessity to make things happen, but it's not their responsibility to do so. Core maintainers need to be actively involved.

Also +1 to @Jianhui-Li's point:

From the PluggableDevice experience, I actually think that we need a modular maintainer's guidance on the technical direction. It is not practical to propose any structural changes if the modular maintainer is planning something else.

While Intel contributed most of the PluggableDevice design and implementation, it was done in close collaboration with TF core maintainers (Google). Some examples:

Google suggested targeting classic TF runtime instead of TFRT because of time constraints and code maturity.
Google decided not to rush a new, generic device C API and proposed the StreamExecutor C API RFC as a short-term solution.
Google implemented the StreamExecutor C API [1, 2, 3, etc.] which enabled Intel to start prototyping PluggableDevice.
Google consistently made design suggestions for Intel's PluggableDevice and Grappler C API RFCs.
Google refactored sharable accelerator code from //tensorflow/core/common_runtime/gpu to a generic //tensorflow/core/common_runtime/device [1, 2, 3, 4, etc] for the new pluggable device type to use.
etc.

We apologize for the delay in responding to this RFC and aim to have better response times from now on.

Re: @stellaraccident:

With that kind of dynamic, code locality is the only lever available to bias towards equal access to the platform, and the option to develop in tree should be retained.

I think that the GPU vendors should be colocated at this point: if one is in tree, then they all should be as a matter of policy. That can be reevaluated if the primary authors publish a more detailed plan and level of support track record for this kind of situation. It would be different if this were the fourth or fifth implementation and/or there was a strong track record of multi targeting for the GPU component specifically. But in effect, this is the second. The only way to evolve the situation from this point is colocation, imo.

Great point here! I'll bring this up to the team.

ezhulenev commented 9 months ago

Regarding Intel GPU support in StreamExecutor, I'd love it to be in tree as stream_executor/sycl (in addition to cuda and rocm) with minimal #ifdefs for Intel specific parts. Today we don't run any tests for rocm platform, and rely on AMD to run them and send patches, we can use the same approach for sycl (until we figure out how to do CI as a part of OpenXLA checks)

Also StreamExecutor today is really a hardware abstraction layer for XLA, so we are considering renaming the namespace to xla::hal and moving to hal folder, and do long overdue clean up. We don't have plans to replace it, we'll continue our investment in it.

joker-eph commented 9 months ago

@joker-eph Do you consider the SYCL runtime support "clearly T2"?

Seems like something that other SYCL vendors would want instead of being a specific component of the IntelGPU backend right? If so I would say yes here.

While Intel contributed most of the PluggableDevice design and implementation, it was done in close collaboration with TF core maintainers (Google).

Absolutely, and actually I alluded to this work (without the links you have!) before in my one of my comments as an example of a "success story". It has to be collaborative, but it still implies "work" from the "new platforms" to help build their extension points.

Zantares commented 9 months ago

@penpornk @joker-eph Quick answer for your question:

Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?

It belongs to the 1st situation for SPIRV target part. We only have a few code changes here (~250 LoC).

LLVM IR changes i. Integrate SPIR-V translator: I think this should be upstreamed since it benefits any SPIR-V-supported devices. @jpienaar do you have a requirement for it to be a pluggable pass at the end? ii. The rest of the changes seem specific to Intel GPU and we'll likely need to look at the code to see how we can make it more parameterized/modularized.

In addition, there's chance to be parameterized for the rest changes in my thought:

Use generic SPIRV instrinsic instead
Add device check (or new PJRT API) to distinguish address index for different device instead of hard code if possible

We can discuss it after preparing the code example, thanks!

cheshire commented 9 months ago

@Zantares There is a number of backend-specific contributions Intel could provide, is it possible to elaborate on planned changes?

There are the following areas which might require changes:

Outlining sections of HLO instructions into Intel library calls, and plumbing Intel library calls into SE
Modifying pre-Triton codegen to support Intel intrinsics
Modifying Triton itself/Triton tiling logic to support Intel GPUs
Modifying set of fusion heuristics, to perform different logic for Intel if necessary.

Is the set of planned changes mainly based around (1)? Is it possible to see the preview of features-to-land?

Zantares commented 9 months ago

Lib call: The lib call rewrites happen in HLO optimization passes.

As mentioned in our CPU Strategy, we are defining a formal integration point for 3rd-party HLO/MLIR-based graph rewrite/optmization passes. The same work applies to XLA:GPU. We would also like to learn what improvements we can make to existing passes to make them more reusable, and if we need to add more extension points (and where).

@Zantares Could you please give some code pointers to the changes you made to the existing HLO passes in your plug-in repo? We could schedule a public meeting call to walk through the code as well if it's easier.

To @penpornk , sorry I have missed this point before. Please check below link to see how we use oneDNN lib call in plug-in. We also started to check CPU RFC to see how to reuse the 3rd-party passes to complete this work.

Modify Thunk to run oneDNN Conv: https://github.com/intel/intel-extension-for-openxla/blob/0.2.0/third_party/openxla.patch#L376-L450
Modify Rewriter to support specific fusion: https://github.com/intel/intel-extension-for-openxla/blob/0.2.0/third_party/openxla.patch#L451-L496
Custom compiler to call Rewriter: https://github.com/intel/intel-extension-for-openxla/blob/0.2.0/xla/service/gpu/spir_compiler.cc#L111-L112

There are the following areas which might require changes:

Outlining sections of HLO instructions into Intel library calls, and plumbing Intel library calls into SE

Modifying pre-Triton codegen to support Intel intrinsics

Modifying Triton itself/Triton tiling logic to support Intel GPUs

Modifying set of fusion heuristics, to perform different logic for Intel if necessary.

Is the set of planned changes mainly based around (1)? Is it possible to see the preview of features-to-land?

To @cheshire , I assume Q1 is already answered above. For Q2 & Q3, we have already made some experiments to support Triton in our plug-in, it still needs lots of work. For Q4, right now in plug-in we reused most of public code and use macro to add specific fusion, in future we want to combine this work with new proposed Global Cost feature in OpenXLA. In general, Q1 & Q4 are based around (1), you can check plug-in first to see the preview: https://github.com/intel/intel-extension-for-openxla. The Triton related part is another story which is not covered in this RFC, but I can say we really have plan to support Triton for Intel GPU.

Zantares commented 9 months ago

Regarding Intel GPU support in StreamExecutor, I'd love it to be in tree as stream_executor/sycl (in addition to cuda and rocm) with minimal #ifdefs for Intel specific parts. Today we don't run any tests for rocm platform, and rely on AMD to run them and send patches, we can use the same approach for sycl (until we figure out how to do CI as a part of OpenXLA checks)

Hi @ezhulenev , does it mean that OpenXLA repo will send the PR patch to AMD CI and get its feedback? Is this detail available in public? We also want to provide similar check here.

ezhulenev commented 9 months ago

No, currently it works as a post-submit, CI jobs runs periodically with latest commit and then we accept patches. @ddunl might know if we have any plans for pre-submit CI jobs for AMD, and do we have plans for supporting other CI jobs. I'd love to have an options to include backend-specific CI jobs optionally for a PR. Not sure about enabling it by default, as even with current set of jobs it takes a long time to get back results.

Zantares commented 9 months ago

No, currently it works as a post-submit, CI jobs runs periodically with latest commit and then we accept patches. @ddunl might know if we have any plans for pre-submit CI jobs for AMD, and do we have plans for supporting other CI jobs. I'd love to have an options to include backend-specific CI jobs optionally for a PR. Not sure about enabling it by default, as even with current set of jobs it takes a long time to get back results.

Thanks for the feedback! It's a good idea that take backend-specific CI as post-submit to avoid possible blocking situations. We'll start from this. And we agree that not enabling backend-specific CI jobs by default. Maybe a better way is to trigger it by GitHub label/specific comment on demand.

ddunl commented 9 months ago

I think for now we don't have specific plans for supporting external CI in general, just that we know we want to do it. Definitely backend-specific CI with the ability to trigger on demand is great though, and would be very helpful

Zantares commented 9 months ago

Hi all, the 1st PR of SPRI-V target is submitted: https://github.com/openxla/xla/pull/7940.

Zantares commented 7 months ago

The SYCL runtime is submitted: https://github.com/openxla/xla/pull/9042

Note this big PR is for preview and CI test, we will continue to split it into small PRs and welcome your suggestions for this PR. Please refer to comment https://github.com/openxla/xla/pull/9042#issuecomment-1920813852 for more details

openxla / community

[RFC] Intel GPU integration #99