RFC - Creating an openxla-nvgpu project

stellaraccident commented 1 year ago

Now that RFC - Proposal to Build IREE Compiler Plugin Mechanism has been implemented and is minimally working with both in-tree and an early adopter out of tree implementation, we are ready to propose the official creation of a new openxla-nvgpu project to house:

Compiler plugins aimed at vendor specific integrations for NVIDIA GPUs.
Runtime extensions for advanced use cases of NVIDIA parts.
Project scaffolding and development flow tooling.

Since this will be the first plugin project of note, we expect that to a certain extent, it will co-evolve with the mechanisms.

Proposed Directory Structure

openxla-nvgpu will have a similar directory layout to IREE, upon which it depends:

compiler/
  src/
    openxla_nvgpu/
      Dialects/
runtime/
  src/
    openxla_nvgpu/
build_tools/

The build setup will use bazel_to_cmake for consistency and interop with the upstream IREE project (now that https://github.com/openxla/iree/pull/12765 has taken the steps to allow it to be used out of tree). The corresponding IREE build macros will be extended as needed.

Dependencies

This project depends directly on:

IREE for the core platform.
openxla-pjrt-plugin for implementing an NVIDIA specific PJRT plugin.

It transitively depends on:

LLVM/MLIR via IREE (will be disaggregated as a standalone dep at a future time to be determined).
xla via openxla-pjrt-plugin for plugin and runtime support.
stablehlo via IREE.

Releases

In the very short term, this project will not produce independent releases and will aim to be usable by project developers who are willing to build it from HEAD.

In the longer term, we would like to introduce a new top-level openxla-compiler releaser project which is responsible for packaging the platform (IREE) with stable versions of all supported vendor compiler plugins and making available as a standalone binary release. Such a project would eventually depend on this one and would effectively be the plugin-aggregated release of the present day iree-compiler packages (which will continue to be released as the "vanilla" platform without out-of-kind platform specific dependencies).

Also in the longer term, as the PJRT plugin support evolves, we anticipate releasing openxla-nvgpu-pjrt binary packages that can be used to interface NVIDIA GPUs to supported ML frameworks via pip install.

Versioning and CI

Adding this top-level project pushes us firmly into a "many-repo" layout for OpenXLA projects. This will be further reinforced as IREE's dependencies and build tools are disaggregated over time and the top-level releaser projects are established.

As part of this, we will introduce a side by side workspace layout where dependencies are found relative to each other based on parent directory. Example:

iree/
xla/
openxla-pjrt-plugin/
openxla-nvgpu/
openxla-compiler-releaser/

Such a layout will be called an "OpenXLA workspace" and we will provide a sync script and CI tooling to help manage it. Each project will pin to green release tags in its parent (or other form of stable commit tracking) by maintaining a local metadata file of its openxla dependencies. The sync script will be both a simple way for developers to track known-good sync points for the workspace and for CIs to advance. There will be a CI bot which advances dependent projects to next stable sync points automatically. We expect that for projects that already have a strong release cadence like IREE, this will update pins to new nightly releases, and others will cascade from at-head commits.

This process of versioning will be developed over time with an eye towards being the one way to manage openxla project dependencies. It will likely be somewhat manual to start with. This will be derived in spirit from the PJRT plugin sync script and enhanced to provide better release tracking and version bump capabilities.

Benchmarking

Benchmarks of the NVIDIA toolchain will be largely inherited and leveraged from the IREE project but run independently so as to provide a continuous view of the performance deltas and characteristics of the platform-independent upstream and the vendor-specific downstream.

Next steps

As a very next step, the openxla-nvgpu project will be bootsrapped in the iree-samples repository. It will be relocated, with history, to a new git repository once this RFC has matriculated.

Project Ownership

The project will be set up as a collaboration between Google and NVIDIA, and per OpenXLA governance, will share maintainer responsibility between contributors from both companies with the goal of NVIDIA engineers taking on core maintainer responsibility as the project bootstraps and evolves.

pjannaty commented 1 year ago

Fantastic! Looking forward to the collaboration.

pjannaty commented 1 year ago

Will this sit directly under https://github.com/openxla as in openxla/openxla-nvgpu?

stellaraccident commented 1 year ago

Will this sit directly under https://github.com/openxla as in openxla/openxla-nvgpu?

Yes, that is what I'm proposing.

pjannaty commented 1 year ago

cc @nluehr

mjsML commented 1 year ago

Does that translate to “write access”?

stellaraccident commented 1 year ago

Does that translate to “write access”?

Yes (which we already do on these repos for other contributors), but I tried to phrase it as more general. Being a "component maintainer" on OpenXLA parlance grants some other privileges in terms of overall openxla direction/management.

ezhulenev commented 1 year ago

What will be the C++ namespace for all the new code in this project? ::xla::...?

stellaraccident commented 1 year ago

What will be the C++ namespace for all the new code in this project? ::xla::...?

Not sure I would speak to "all of the new code", but historically, dialects and components of IREE take a namespace like ::mlir::iree_compiler::IREE::<COMPONENT>, which nests well by letting everything resolve components by IREE::. For the record, I don't love that it is all rooted under mlir but that has been that way for a long time and could be cleaned up.

If not extending that namespacing scheme, I'd encourage at least something similar: ::openxla_compiler::XLA. I've deviated from this a couple of times in the past and always regretted it and come back to normalize.

joker-eph commented 1 year ago

Proposed Directory Structure

openxla-nvgpu will have a similar directory layout to IREE, upon which it depends:
compiler/
  src/
    openxla_nvgpu/
      Dialects/
runtime/
  src/
    openxla_nvgpu/
build_tools/

I'm curious why the sub-directory names are repeating "openxla"? That is why compiler/src/openxla_nvgpu instead of compiler/src/nvgpu? Even further, if the repository itself is all about nvgpu, why repeating it at all? What else will be inside compiler/src in this repo?

stellaraccident commented 1 year ago

It is just following the IREE convention of the include directory being rooted at the src/ and wanting to arrive at fully qualified include paths (i.e. `#include "openxla_nvgpu/...").

I've more or less resigned myself to the fact that the least bad thing is to repeat yourself exactly once in service of having globally unique include paths.

ezhulenev commented 1 year ago

I'm not a big fan of mlir top level namespace, and iree_compiler as nvgu really just a project on top of IREE compiler, and I don't like openxla_compiler namespace, and I don't have any suggestions :)

What if I want to write custom VM module under runtime/src/openxla_nvgpu? In IREE it's mostly under iree and iree::vm namespace, should it be ::openxla or regular ::xla here?

joker-eph commented 1 year ago

Oh I missed the point about the include directory convention, makes sense! (I believe this convention is in place in mlir-hlo as well, I remember a rationale doc for the path structure, I think @burmako originated it, I don't know if it is public)

stellaraccident commented 1 year ago

I'm not a big fan of mlir top level namespace, and iree_compiler as nvgu really just a project on top of IREE compiler, and I don't like openxla_compiler namespace, and I don't have any suggestions :)

What if I want to write custom VM module under runtime/src/openxla_nvgpu? In IREE it's mostly under iree and iree::vm namespace, should it be ::openxla or regular ::xla here?

Just throwing things out there... ::openxla::nvgpu? We're going to regret whatever we choose. Might as well at least not start with a namespace that is already used.

ezhulenev commented 1 year ago

I'd go with ::openxla, ::openxla::compiler, ::openxla::runtime, etc...? And not mention nvgpu at all. Do we see a single project depending on multiple "openxla compilers", e.g. openxla-nvgpu and openxla-amdgpu? Or link into single binary?

stellaraccident commented 1 year ago

I'm game to try it. Like I said, I'm pretty sure that what we pick, we come back in a few months with some code written and apply a bit of sed, but I do think we want globally unique: there are many use cases where these will be linked together for both recommended and unrecommended reasons. Let's not set ourselves up for accidental name collisions.

Also, with C++17, the cost of namespaces (in terms of keystrokes) is a lot lower.

joker-eph commented 1 year ago

I would be concerned with plugins and target specific components redefining symbols in shared namespace: I'm not sure what the benefits of eliding the target from the namespace buys us in practice? (nesting compilers under ::mlir (as in ::mlir::mhlo:: for example) is nice in that you have an implicit using namespace mlir; (maybe just a workaround for the google style banning using namespace?), but we're still heavily using a specific sub-namespace to compartimentalize).

stellaraccident commented 1 year ago

maybe just a workaround for the google style banning using namespace

(ftr - we're not exporting such legacy rules to our new OSS projects)

ezhulenev commented 1 year ago

Another directory naming question, why compiler/src/openxla_nvgpu and not compiler/src/openxla/nvgpu?

Will we also have some kind of openxla-base (util, ...) for libraries shared between different "backends", or the plan to keep shared code in iree repo?

Will we depend on absl/tsl? E.g. logging, Status, StatusOr inside nvgpu compiler/runtime? Or in compiler use LLVM logging (almost non existent), and in runtime use IREE logging (no idea what's the status).

ezhulenev commented 1 year ago

stellaraccident commented 1 year ago

Will we also have some kind of openxla-base (util, ...) for libraries shared between different "backends", or the plan to keep shared code in iree repo?

That is being discussed, but we are biasing towards getting moving at the moment over busting out dependencies. We have some work to do to get the dev/versioning workflow going for the level of things we have, and I'd rather get some more mileage on that before we go too crazy with a lot of repositories.

Will we depend on absl/tsl? E.g. logging, Status, StatusOr inside nvgpu compiler/runtime? Or in compiler use LLVM logging (almost non existent), and in runtime use IREE logging (no idea what's the status).

The further "core-ward" we go, we have no plan to depend on absl/tsl, and I would be somewhat resistant to doing so because they have both proven to be problematic (so much so that we excised after thinking "how bad could it be?").

Concrete thoughts...

I don't think that we should be mixing universes in the compiler code and need to "build up" from LLVM vs grafting other base libraries.

The runtime code for nvgpu has some more give to it from a dependency standpoint, but for the level of things expected to be in there, I would like to avoid the complexity that comes from taking complicated deps if possible.

Some of this stuff is preference and some has proven to be more trouble than it is worth in the past... The hard line that we can't cross in this repo is that dependencies must be thin and must have a well supported CMake build.

stellaraccident commented 1 year ago

Another directory naming question, why compiler/src/openxla_nvgpu and not compiler/src/openxla/nvgpu?

I don't have a preference.

sherhut commented 1 year ago

Newer waste a good opportunity to bike-shed :smile:

Logging, as brought up by @ezhulenev, caught my eye. For the compiler, I would suggest we try to use the diagnostic handler infrastructure as much as possible and log warnings/errors there. That will force us to provide messages with good context. For "trace what this does" debugging use cases, I agree that LLVM-DEBUG is not great but will work for now. We will need to replace this later with something that has severity and other bells and whistles but that can be done. I am used to the VLOG interface and we could have a shim that maps to LLVM-DEBUG if we decide to care already. Maybe IREE wants to provide this akin to what tsl does, so that different integrations can swap in what fits their needs. Depending on tsl just for that seems a bit heavy.

Regarding namespaces: I agree with @stellaraccident: Lets bias on progress rather than perfect choice for now. Having said that, I personally would use openxla::compiler::nvgpu for the CUDA compiler. We will have more compilers sooner than later and having them in different namespaces helps my mental model of where code belongs. Might be xla bias and I have no strong opinion. Just some :bike: :hut: :paintbrush:

Excited to see this project spin up!

stellaraccident commented 1 year ago

This has been a heavy overhead week for me but I should finally get some coding time this afternoon, and since I can probably bootstrap the project somewhat efficiently, I'll take a stab at that. As noted, I'll stage it in an iree-samples directory first and will then hand off to someone Google-side to create the repo (which requires a bit of red tape).

stellaraccident commented 1 year ago

All right... the above two commits seem to get me most of the way there. Things build, etc. Was a bit of a slog.

ezhulenev commented 1 year ago

And initial CuDNN custom module that does nothing related to CuDNN yet: https://github.com/iree-org/iree-samples/pull/123/files

stellaraccident commented 1 year ago

Both of the initial commits have landed.

@theadactyl Can we get someone with admin rights on the org to create the openxla-nvgpu repo? It should be populated with https://github.com/iree-org/iree-samples/tree/main/openxla-nvgpu

GMNGeoffrey commented 1 year ago

Coming in late and I think just agreeing with what's already been decided, but strong positions on: not taking an absl or tsl dep in the compiler and scoping the [sub]namespace to the project (so not just "iree" or "openxla" or "compiler")

I do think that the Google (Titus) advice on not having deeply nested namespaces is pretty good and not just a weird Google thing: https://abseil.io/tips/130. Every level of nesting gives us a fun new place for collisions. So I would vote for something like openxla_nvgpu as a top-level namespace. Probably nesting it under mlir would be fine, since we shouldn't define things that conflict with MLIR anyway. I think there was at some point a limitation of ODS that meant you had to define things in the mlir namespace, but I also suspect that this is a relic of the Google C++ style guide ban on using namespace mlir, which while not completely misguided IMO, seems excessively strict. using namespace mlir seems better than sneakily getting that by just defining everything nested under that namespace. I guess the difference is that you can only do the latter for one namespace. I think we can probably limit ourselves to using namespace mlir and using namespace llvm and then not have to nest our namespaces, unless there's some other reason to do so.

ezhulenev commented 1 year ago

I don't like openxla_nvgpu namespace, because presumably we'll have some kind of openxla-base shared by all openxla compilers, and it's nice to share a namespace to skip qualifying all imports.

+HUGE to using namespace but only for small number of things (mlir and llvm)

openxla / community