llvm / torch-mlir

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
Other
1.29k stars 474 forks source link

[RFC] Bootstrapping the TCP Dialect in Torch-MLIR #1366

Open sjarus opened 1 year ago

sjarus commented 1 year ago

We are announcing the effort to bootstrap the early development and technological demonstration of the TCP dialect (public specification, technical discussion) under llvm-external-projects within Torch-MLIR.

The TCP dialect intends to offer strong support for backend compilation connectivity to LinAlg, Tensor and other dialects, compiler friendly support for dynamic shapes, broadcasting and other capabilities, and the Torch-MLIR pathway offers an opportunity to demonstrate these dialect capabilities to augment the existing backend paths.

The initial effort will focus on connectivity both into and out of the TCP dialect, emphasizing the technical value proposition, and we welcome input into the mechanics of the op construct and connectivity.

@navahgar @sanjoy @sjain-stanford @silvasean @jpienaar @stellaraccident

silvasean commented 1 year ago

Overall I'm supportive, but I think the MLIR community needs to make a final decision here. I would recommend that we elaborate a bit on the details of the bringup plan, expected timelines, "unknowns", scoping, etc. and make a new RFC thread on MLIR discourse. I think that having a demo branch with Mlp2LayerModule_basic running e2e as prep work for the RFC would give folks some concrete code to rally around and really accelerate things.

Especially Chris's point about "what's not part of TCP" is really useful to define (at least as a draft to rally around).

stellaraccident commented 1 year ago

Hey Sean, I obviously don't want to go against the community consensus here, but I do wonder if we are holding a bit of a different standard vs the rest of development that happens in torch-mlir (for which, it seems like this group is "the community"). For example, discussions about ONNX, MHLO connectivity, etc all happened here vs via LLVM side RFC.

But it also sounds like you would like to see some more definition, and I don't want to discount that and assume anything. I'm running behind from vacation and think I missed some back and forth. Would you mind deep linking to the question you would like to see answered?

(my personal opinion is that we have struggled a lot with the "mkdir" problem on this, and I'm somewhat interested in seeing motivated engineers be able to make contributions vs being homeless)

stellaraccident commented 1 year ago

How about to unblock progress, you all just create a tcp branch in the torch-mlir repo. Work there, and review each patch per usual guidelines as if it is going into mainline. Then we rebase/merge the whole thing once we are able to discuss.

There shouldn't be much development flow difference to working on the tcp branch vs main. That is pretty inline with normal development flows in the incubators.

sjain-stanford commented 1 year ago

How about to unblock progress, you all just create a tcp branch in the torch-mlir repo. Work there, and review each patch per usual guidelines as if it is going into mainline. Then we rebase/merge the whole thing once we are able to discuss.

This is a good suggestion. I just created a mlir-tcp branch in this repo. We will change the base branch for all TCP PRs to use mlir-tcp from now. Each PR can be reviewed as usual, and at a later point in time, we can do a single mlir-tcp -> main rebase / merge PR that would not need a re-review but just the CI to pass.

stellaraccident commented 1 year ago

sg. Just keep the history clean and reviewed. Then assuming we decide to go forward with it at a future checkpoint, we accept it in full (applying any additional feedback/patched per usual).

jpienaar commented 1 year ago

I'm very excited about this starting. Definitely a lot of questions to answers and to refine. I bias there on being able to see things and evaluate how they fit and value they bring. Having an open venue where this initially bootstrapping can happen (while ensuring community has visibility and/or is able to participate in design discussions) SGTM. I think there is a lot to consider and these won't bet cut whole, but rather refined iteratively. Answering the questions from the community, showing the connectivity, testing, interop etc is part goal of this phase, and incubator is good avenue [Care of course that this be self-contained and not "infected" by PyTorch ( 🙂 ) to avoid some of the concerns from other thread.]

Branch SGTM as start. And I know folks here have plans to ensure transparency and involvement.

silvasean commented 1 year ago

Hey Sean, I obviously don't want to go against the community consensus here, but I do wonder if we are holding a bit of a different standard vs the rest of development that happens in torch-mlir (for which, it seems like this group is "the community"). For example, discussions about ONNX, MHLO connectivity, etc all happened here vs via LLVM side RFC.

For those examples, the stakeholders were either just Torch-MLIR or "not broadly the MLIR upstream community". This case is different, because, as far as I know, the key stakeholders are the broader upstream MLIR community. The goal is for TCP to be "for all of the MLIR ecosystem and all frontends", not just Torch-MLIR + non-upstream stakeholders.

If this were pitched as a way that did not interact with the broader community (for example, by saying "TCP + related lowerings is initially just a way for Torch-MLIR to greatly simplify its backend story, but we are designing it with an eye towards broader applicability") then it would be a different discussion. If we want to pivot that direction I'm open to that.

Overall I am fine to have the development on a branch to solve the mkdir problem in the short term, as long as we have at least a vague plan for what's going to happen. I want to avoid a situation where we say "oh, it will only be a month" and then every month extend it another month for ~forever. If we see this branch as just a PoC for a future (2022Q4) more concerted RFC I think that is fine.

But it also sounds like you would like to see some more definition, and I don't want to discount that and assume anything. I'm running behind from vacation and think I missed some back and forth. Would you mind deep linking to the question you would like to see answered?

https://discourse.llvm.org/t/rfc-proposal-for-a-high-level-ml-dialect-in-mlir/64249/163?u=_sean_silva

Especially "What are the principles that guide its design, and what are you saying “no” to?"

Personally, I think that designing something that connects well with Linalg, TOSA, and MHLO/StableHLO is probably a strong enough regularizer on the design that I'm not worried, assuming that the dialect is brought up in a concerted fashion with that goal (and tested as such). I get more worried if we are just sort of abstractly designing a new dialect and not actually doing the work to demonstrate value-add by replacing the existing components with a better solution.

Concretely, what I would like to see from Torch-MLIR's value-add perspective is that as a new TCP op is added, MHLO -> TCP -> {TOSA,Linalg} lowerings are atomically implemented and replace the independent lowerings we have now. I want to converge on a design like this where most lowerings go through MHLO (eventually StableHLO) and we use transitive links to reduce the number of conversion patterns that have to be written outbound from the torch dialect:

image

And I'm hoping that the dynamic shape design for TCP will be "right" and when we switch MHLO to StableHLO we adopt that to make the whole system way more robust. (@burmako has already expressed strong interest in using the dynamic shape design from TCP to inform StableHLO on a longer time frame).

We might want to swap TCP and MHLO in that diagram too possibly, since MHLO -> {Linalg,TOSA} are already existing or in development for independent reasons. (in that case TCP really would only have torch inbound and mhlo outbound and so perhaps less connected -- what is the value-add vs going to MHLO directly?).

sjarus commented 1 year ago

Updating this issue at year-end:

  1. As previously agreed, we have a branch mlir-tcp (https://github.com/llvm/torch-mlir/tree/mlir-tcp) here. All work to date has gone into this branch: https://github.com/llvm/torch-mlir/pull/1375 https://github.com/llvm/torch-mlir/pull/1595 https://github.com/llvm/torch-mlir/pull/1695 https://github.com/llvm/torch-mlir/pull/1726 All of these were reviewed to the same standard as the mainline content in torch-mlir as this was our understanding.
  2. The above work contains the dialect boilerplate, basic constructs posted initially for review on Discourse, as well as end to end tests. All of the dialect contents are in externals/llvm-external-projects/torch-mlir-dialects . The only contents outside of that are some of the conversion pieces - TorchToTcp for example.
  3. Currently the build rules are configured to disable TCP compilation from the CI critical path, except by config flag (TORCH_MLIR_DIALECTS_ENABLE_TCP).

What we'd like to do next is: To support both Cruise-internal development and further public alignment and contributions, we would like to merge mlir-tcp into main.

The goal remains to retain this in the same externals/llvm-external-projects path, with the same conditional compilation pathway, and to retain the same rigor of compilation but take away the branch. In a timebound manner in the future, we would like to then work on determining how to upstream this work out of torch-mlir so it is broadly accessible to other MLIR projects, while retaining the PyTorch related legalizations in Torch-MLIR.

Please let us know your thoughts.

@sanjoy @sjain-stanford @navahgar @asaadaldien @stellaraccident @silvasean @jpienaar @burmako

rengolin commented 1 year ago

After having some of our changes going through IREE (pack/unpack tensor ops), I am sceptical of this process, but I'm not against it. There may not be a better one, it seems.

Our internal design has changed considerably from our internal prototype (we only had one op, relayout, we used afine maps, and we operated on both tensors and memrefs, so needed less lowering).

While going through IREE, we split the ops (which was a good thing), changed from affine maps to attributes (which is a matter of opinion), and we lost the memref support, which broke any hope of using the upstream ops straight away, as we had no implementation further than tensors.

This has led to us having three implementations:

With more lowering making its way into upstream, we're finally moving to those ops, but it's not a trivial exercise. IREE will have to do the same on their own repo, which multiplies the efforts. And this is just for two new ops in an existing dialect that does not replace, but adds, semantics. TCP is a whole different beast.

I'm not sure the alternative (doing it full upstream) would be much easier. I can't measure the churn that pack/unpack would have had if it was done upstream, before we converged into a working form. I don't have enough information to say that it would have been easier for an entire dialect, so I just leave my own experiences here, without judgement.

I think the silver lining is that you're including MHLO in the process from the start, so that it would at least be "good enough" for the two main (and reasonably different) formats (HLO, Torch).

From our side (TPP-MLIR), we'd only be able to start testing it once it's on LLVM main. I imagine any small team like ours will be the same. It's also hard to review code in torch-mlir because of the differences in style and expectations.

I imagine this will be a round between torch and xla folks first, until upstream to LLVM, then it will be another round of people not involved initially (like us), and there will be a "second wave" of reviews and change requests, and I'd like to request flexibility on this second wave if it seems the people that didn't review the first round find important changes and not be stuck on the design that the original group came up with.

silvasean commented 1 year ago

I don't see the current dialect as having reached a level of design maturity to merge into main.

And purely from a timeline perspective, it has taken 3+ months to hash out the elementwise+broadcast design, which if I guesstimate correctly would put the first ResNet e2e run over a year out. We are aiming to use Torch-MLIR here as a temporary staging ground and year+ timelines do not fit that.

To give a guide on the thinking here, I would like to contrast this with the addition of MHLO support to main (RFC here). In that RFC:

bhack commented 1 year ago

I suppose that MHLO will be under the OpenXLA GitHub org governance right?

https://github.com/openxla/xla/tree/main/xla/mlir_hlo/mhlo/IR

https://github.com/openxla/community/pull/28

sjain-stanford commented 10 months ago

Hi everyone. To close the loop on this, we have found a new home for TCP (https://github.com/cruise-automation/mlir-tcp) and are in the last few stages of completing the internal integrations. We have now paused contributions to the mlir-tcp branch of Torch-MLIR, and we should be good to archive this branch once the integrations are complete (hopefully in a week or two). Truly appreciate the support all along in boot-strapping TCP under Torch-MLIR, it definitely made bringup a ton easier :)

cc: @stellaraccident , @ramiro050 , @silvasean, @sjarus , @zezhang , @navahgar , @sanjoy

silvasean commented 10 months ago

Awesome!!

sjain-stanford commented 10 months ago

We have now paused contributions to the mlir-tcp branch of Torch-MLIR, and we should be good to archive this branch once the integrations are complete (hopefully in a week or two).

I can confirm our downstream integrations are now complete. I have deleted the mlir-tcp branch (after tagging it as archive/mlir-tcp for the purpose of archiving) to prevent future commits to the branch.