Closed Zantares closed 8 months ago
Hi all, thanks for all of your discussion, I'm glad to see so many suggestions!
Here are some updates/plans for future work after simple talked with Google:
#ifdef
in code and upstream the common changes first, since it's very simple and can help to parameterize pass API to support more pluggable device in futureHi all,
Sorry for the delayed response! I somehow missed that the RFC was opened until now. Below is what I understood from the discussion:
Intel's motivations for upstreaming some parts of the Intel GPU PJRT plug-in:
So far, we seem to all agree that:
#ifdef
s interspersed in code is bad and should be avoided as much as possible.
There are also different levels of in-tree-ness/modularity that should be distinguished:
http_repository
. If I understand correctly, Intel proposes upstreaming 3 separate components T2-style, and will have fast-moving, Intel-specific features (not mentioned in this RFC) remain in the plug-in (T4). @joker-eph initially proposed the 3 components should be done T3-style, but later on seemed to be okay with T2 as well, did I get this right?
Let's decide on each component separately?
Thanks for the great summary @penpornk!
@joker-eph initially proposed the 3 components should be done T3-style, but later on seemed to be okay with T2 as well, did I get this right?
I think there is a nuance :) To me all the shared components / library probably makes sense in-tree: that is T2 (even though: things could also be modular enough to be their own repo as well: imagine the SPIRV runtime in its own repo, with clean enough API that any SPIRV platform could just reuse it directly? The drawback is that a constellation of small components can become harder to manage/integrate/evolve).
On the other hand for the "non shared components", it's not clear to me why T2 is the best, it seems worth evaluating more closely if every platform could have its own repo for the non-shared code (without duplicating code: that would defeat the point of course), that is T4 I think? (which is how the TPU compiler is setup somehow).
Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?
- I don't have a preference between out-of-tree (T3/T4) vs in-tree (T2) and delete if things are inactive -- I'm assuming Intel will help maintain the backend. What do people think?
I think that for non NVIDIA GPU targets, it is a very non abstract (and primarily, non technically driven) desire to be in tree: in my experience, core developers for this kind of system -- which has grown in a non modular fashion against one target, benefit from a tangible itch to scratch. Putting the burden on those who come second or third to take the step to live outside of the peripheral vision of the primary authors is unfair and will not produce a good result with the style of development that the core authors do on this component. Case in point: it was three weeks before someone affiliated with primary stewardship of the project commented (and thank you for doing so, Penporn -- much appreciated). With that kind of dynamic, code locality is the only lever available to bias towards equal access to the platform, and the option to develop in tree should be retained.
I think that the GPU vendors should be colocated at this point: if one is in tree, then they all should be as a matter of policy. That can be reevaluated if the primary authors publish a more detailed plan and level of support track record for this kind of situation. It would be different if this were the fourth or fifth implementation and/or there was a strong track record of multi targeting for the GPU component specifically. But in effect, this is the second. The only way to evolve the situation from this point is colocation, imo.
Thanks for the summary @penpornk ! We will have some internal discussion first and give the feedback later.
Thank you all for the quick replies! :) The tally so far:
Proposal | Shared code (multi-vendor) | Vendor-specific code |
---|---|---|
Intel | T2 | T2 |
@joker-eph | T2 (T3 if/when makes sense) | T4 |
@stellaraccident | T2 | T2 |
I also got feedback from JAX/PJRT. The code structure underneath doesn't matter to them, they only require that the device support for JAX still be built and packaged as a separate PJRT plug-in (even if the code lives in the same XLA tree).
Our team will look more into the Intel GPU plug-in code before making per-component suggestions too. The goal is to still keep XLA in good health with reasonable work scopes that fits each party's high-level goals and timelines.
Re: @joker-eph:
Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?
Great question. @Zantares, do you have a quick answer here? Starting with clearly-T2 things makes sense to me. Maybe we can start with the LLVM-SPIRV translator PR?
@joker-eph Do you consider the SYCL runtime support "clearly T2"?
Re: @stellaraccident:
Putting the burden on those who come second or third to take the step to live outside of the peripheral vision of the primary authors is unfair and will not produce a good result with the style of development that the core authors do on this component.
I agree. Device vendors may help with core code however they can out of goodwill / necessity to make things happen, but it's not their responsibility to do so. Core maintainers need to be actively involved.
Also +1 to @Jianhui-Li's point:
From the PluggableDevice experience, I actually think that we need a modular maintainer's guidance on the technical direction. It is not practical to propose any structural changes if the modular maintainer is planning something else.
While Intel contributed most of the PluggableDevice design and implementation, it was done in close collaboration with TF core maintainers (Google). Some examples:
//tensorflow/core/common_runtime/gpu
to a generic //tensorflow/core/common_runtime/device
[1, 2, 3, 4, etc] for the new pluggable device type to use.We apologize for the delay in responding to this RFC and aim to have better response times from now on.
Re: @stellaraccident:
With that kind of dynamic, code locality is the only lever available to bias towards equal access to the platform, and the option to develop in tree should be retained.
I think that the GPU vendors should be colocated at this point: if one is in tree, then they all should be as a matter of policy. That can be reevaluated if the primary authors publish a more detailed plan and level of support track record for this kind of situation. It would be different if this were the fourth or fifth implementation and/or there was a strong track record of multi targeting for the GPU component specifically. But in effect, this is the second. The only way to evolve the situation from this point is colocation, imo.
Great point here! I'll bring this up to the team.
Regarding Intel GPU support in StreamExecutor, I'd love it to be in tree as stream_executor/sycl
(in addition to cuda and rocm) with minimal #ifdefs for Intel specific parts. Today we don't run any tests for rocm platform, and rely on AMD to run them and send patches, we can use the same approach for sycl (until we figure out how to do CI as a part of OpenXLA checks)
Also StreamExecutor today is really a hardware abstraction layer for XLA, so we are considering renaming the namespace to xla::hal
and moving to hal
folder, and do long overdue clean up. We don't have plans to replace it, we'll continue our investment in it.
@joker-eph Do you consider the SYCL runtime support "clearly T2"?
Seems like something that other SYCL vendors would want instead of being a specific component of the IntelGPU backend right? If so I would say yes here.
While Intel contributed most of the PluggableDevice design and implementation, it was done in close collaboration with TF core maintainers (Google).
Absolutely, and actually I alluded to this work (without the links you have!) before in my one of my comments as an example of a "success story". It has to be collaborative, but it still implies "work" from the "new platforms" to help build their extension points.
@penpornk @joker-eph Quick answer for your question:
Now: it is possible that 90% of what Intel has here is in the "T2" category and the T4 part is very small, or it is possible that it's the opposite, or anywhere in the middle. I haven't tried to make an assessment of this. What seemed valuable was to start upstreaming the things that are clearly "T2 component" cleanly, and then see where we are?
It belongs to the 1st situation for SPIRV
target part. We only have a few code changes here (~250 LoC).
LLVM IR changes i. Integrate SPIR-V translator: I think this should be upstreamed since it benefits any SPIR-V-supported devices. @jpienaar do you have a requirement for it to be a pluggable pass at the end? ii. The rest of the changes seem specific to Intel GPU and we'll likely need to look at the code to see how we can make it more parameterized/modularized.
In addition, there's chance to be parameterized for the rest changes in my thought:
SPIRV
instrinsic insteadWe can discuss it after preparing the code example, thanks!
@Zantares There is a number of backend-specific contributions Intel could provide, is it possible to elaborate on planned changes?
There are the following areas which might require changes:
Is the set of planned changes mainly based around (1)? Is it possible to see the preview of features-to-land?
Lib call: The lib call rewrites happen in HLO optimization passes.
- As mentioned in our CPU Strategy, we are defining a formal integration point for 3rd-party HLO/MLIR-based graph rewrite/optmization passes. The same work applies to XLA:GPU. We would also like to learn what improvements we can make to existing passes to make them more reusable, and if we need to add more extension points (and where).
- @Zantares Could you please give some code pointers to the changes you made to the existing HLO passes in your plug-in repo? We could schedule a public meeting call to walk through the code as well if it's easier.
To @penpornk , sorry I have missed this point before. Please check below link to see how we use oneDNN lib call in plug-in. We also started to check CPU RFC to see how to reuse the 3rd-party passes to complete this work.
There are the following areas which might require changes:
- Outlining sections of HLO instructions into Intel library calls, and plumbing Intel library calls into SE
- Modifying pre-Triton codegen to support Intel intrinsics
- Modifying Triton itself/Triton tiling logic to support Intel GPUs
- Modifying set of fusion heuristics, to perform different logic for Intel if necessary.
Is the set of planned changes mainly based around (1)? Is it possible to see the preview of features-to-land?
To @cheshire , I assume Q1 is already answered above. For Q2 & Q3, we have already made some experiments to support Triton in our plug-in, it still needs lots of work. For Q4, right now in plug-in we reused most of public code and use macro to add specific fusion, in future we want to combine this work with new proposed Global Cost feature in OpenXLA. In general, Q1 & Q4 are based around (1), you can check plug-in first to see the preview: https://github.com/intel/intel-extension-for-openxla. The Triton related part is another story which is not covered in this RFC, but I can say we really have plan to support Triton for Intel GPU.
Regarding Intel GPU support in StreamExecutor, I'd love it to be in tree as
stream_executor/sycl
(in addition to cuda and rocm) with minimal #ifdefs for Intel specific parts. Today we don't run any tests for rocm platform, and rely on AMD to run them and send patches, we can use the same approach for sycl (until we figure out how to do CI as a part of OpenXLA checks)
Hi @ezhulenev , does it mean that OpenXLA repo will send the PR patch to AMD CI and get its feedback? Is this detail available in public? We also want to provide similar check here.
No, currently it works as a post-submit, CI jobs runs periodically with latest commit and then we accept patches. @ddunl might know if we have any plans for pre-submit CI jobs for AMD, and do we have plans for supporting other CI jobs. I'd love to have an options to include backend-specific CI jobs optionally for a PR. Not sure about enabling it by default, as even with current set of jobs it takes a long time to get back results.
No, currently it works as a post-submit, CI jobs runs periodically with latest commit and then we accept patches. @ddunl might know if we have any plans for pre-submit CI jobs for AMD, and do we have plans for supporting other CI jobs. I'd love to have an options to include backend-specific CI jobs optionally for a PR. Not sure about enabling it by default, as even with current set of jobs it takes a long time to get back results.
Thanks for the feedback! It's a good idea that take backend-specific CI as post-submit to avoid possible blocking situations. We'll start from this. And we agree that not enabling backend-specific CI jobs by default. Maybe a better way is to trigger it by GitHub label/specific comment on demand.
I think for now we don't have specific plans for supporting external CI in general, just that we know we want to do it. Definitely backend-specific CI with the ability to trigger on demand is great though, and would be very helpful
Hi all, the 1st PR of SPRI-V target is submitted: https://github.com/openxla/xla/pull/7940.
The SYCL runtime is submitted: https://github.com/openxla/xla/pull/9042
Note this big PR is for preview and CI test, we will continue to split it into small PRs and welcome your suggestions for this PR. Please refer to comment https://github.com/openxla/xla/pull/9042#issuecomment-1920813852 for more details
This RFC is from Intel, following
XLA GPU
roadmap to integrate Intel GPU as in-tree device in OpenXLA. Hope to get your suggestion for this proposal, thanks!