[RFC] Contributing an ONNX to TOSA legalization pass to onnx-mlir

sjarus commented 2 years ago

Hello, I work in the ML Technology group at Arm and am the principal developer of the TOSA dialect in MLIR. The TOSA spec is here. We also contributed legalizations from TensorFlow / TensorFlow Lite and more recently PyTorch to TOSA. There also exists legalizations from TOSA to LinAlg, Standard and SCF, enabling full end to end runs – we recently ran the ResNet18 network from torchvision through the Torch-MLIR e2e stack in this manner, in under two months from start of work. The TensorFlow/TensorFlow Lite legalizations enable dozens of real world networks to run, and are used at Arm and elsewhere, e.g. in the Google IREE compiler.

We would like to propose adding legalizations for ONNX to TOSA similar to the above. A general picture of this proposal is:

It would be a pass sitting within the existing onnx-mlir codebase, e.g. src/Conversion/ONNXToTOSA/ (there is a working prototype of this)
It would be community developed, and we welcome interest from the onnx-mlir community in being part of this effort as well.
The development is intended to leverage common TOSA construction APIs that are used by both the TensorFlow and PyTorch paths. It enables a common pathway for legalizations and supporting artifacts like shape inference.

Benefits to MLIR and ONNX-MLIR community:

Integration between ONNX and ongoing e2e MLIR development efforts.
Accessibility of ONNX content to multiple MLIR based projects, in addition to existing access to TensorFlow and PyTorch content.

Potential concerns:

Versioning between ONNX and TOSA, both of which are versioned op sets. There's ongoing conversation related to versioning in MLIR that might help address this.
Testability. Beyond lit testing, we have internal random test generators to validate legalizations from TensorFlow/TensorFlow Lite as well as PyTorch. They could potentially be extended to compare bit-level output from an ONNX runtime against TOSA reference model.

Copying some principal participants in the conversation on this so far: @stephenneuendorffer, @stellaraccident, @AlexandreEichenberger

stephenneuendorffer commented 2 years ago

I'm really excited to see this. This is something that Xilinx/AMD should be able to help build out.

stellaraccident commented 2 years ago

I'm really excited to see this as well. Better connections here have long been on my wish list and seemed like gaps we could solve and make things better connect for everyone.

rsuderman commented 2 years ago

It would be awesome to see this land. ONNX has done an admirable amount of work in defragmenting the ML space and having it target a hardware / codegen motivated IR would be an asset to the entire community.

AlexandreEichenberger commented 2 years ago

Hi @sjarus

Thanks for starting this discussion.

We have thought of ways to interface with MLIR in different ways, right now we do that at the Affine/SCF/Vector... level. An approach build on TOSA may indeed provide another path to optimized code. We will need a bit of time to familiarize ourselves with TOSA and learn more about your thoughts on how to use ONNX-MLIR.

At a high level, such an integration can be pretty smooth. I believe a pass that moves ONNX ops to TOSA ops would work well. We already have something like this internally, where ops that are suitable for a different optimization path can be plucked out from ONNX and then follow their separate optimization path.

I am generally familiar with TOSA and there are definite overlaps between ONNX and TOSA. One thing with ONNX is that operations are at time purposefully defined in a very generic way. To give you an example, AveragePooling and Conv are defined as n-dimensional operations. Others have quirks, for example AveragePool output side can be computed using ceiling or floor functions. There are also many operations supporting broadcasting. So it would be useful to determine upfront what would the strategy be when ONNX operations present cases not directly supported by TOSA. When using TOSA, do you intend it to be TOSA exclusively, or TOSA and our current path for the unsupported operations.

I could not tell by a quick scanning of the TOSA document, but I assume that "runtime" dimensions are also supported. We developed some support for this in ONNX-MLIR where shapes that are statically defined are detected early during shape inference, and we use the same code during lowering to introduce runtime computations in the instruction stream when values are only available at runtime. ONNX is full of potentially runtime arguments that need computations, for example an index in the range [-d,...,0,...,d-1] where positive values are offset from the most significant dimension (left) and negative values are offset from the least significant dimension (right) of n-dimensional tensors.

Note that ONNX-MLIR is a driver that compiles ONNX graph, lower it to a library (static or dynamic). We have a very light runtime to handle tensors (building tensors and list of tensors). We have also a C/C++/Python/Java interface for the runtime. Runtime includes support to compile a graph and execute it. Our reliance on this thin layer to handle inputs and outputs may have a significant impact on using TOSA, especially of TOSA is not exclusively used.

On your potential concerns.

Versioning: ONNX-MLIR uses its own copy of MLIR/LLVM which is periodically updated. So problems would be exposed if TOSA deprecate an operation, and we would be merging toward a newer MLIR/LLVM version. The frequency of the updates varies, we try to update our LLVM once or twice a month, typically on a need basis. The cost of divergence between the current MLIR/LLVM and the one used by ONNX-MLIR is paid when updating our version of MLIR/LLVM.
Updates on ONNX ops often requires some work on us, especially when new features on existing ops are added. Our general strategy is to limit the number of redundant implementation of similar ops. For examples, we have seen parameters that used to be compile time constant becoming runtime variable. In such case, we try to migrate our internal support of that op to support the runtime version, migrate a constant parameter into the runtime input, and possibly special case the lowering if beneficial to know that an operation has a constant input. New ONNX ops does not cost anything when updating the standard, except that eventually they should be implemented. We currently support about 80% of the ops (many do not exist in tests and are not used in the current model zoos). We update our ONNX level rarely, maybe once or twice a year. Right now we are about 3 version back.
Testability. We have invested quite a bit in testability, because when optimizing opts with many parameters, we simply don't trust ourselves to just get it right. So we have lit tests, we have the ONNX tests (which requires building ONNX and python), we have C++ tests that use rapidcheck to generate test cases with random input parameters and verify correctness against a reference implementation, and finally we are adding performance tests that measure performance and compare current recorded performance.
Since onnx-mlir is a driver, all of the tests that we use are tweaked by options (such as -O level, machine target/arch/cpu,...). So I assume that the invocation of the TOSA optimization would be triggered by an option. So essentially our infrastructure would work right away. Our current CIs require success on x86(Windows, Linux, OSX optional), POWER (optional), and LOZ (Z servers using Linux, we may add ZOS in the future). Our tests verify the DEBUG and RELEASE version, and include runtime tests where inputs have compile time shapes, runtime shapes, and constant input values (as the compiled code is different for each of these 3 cases). We are open to add additional targets if they are supported by its proponents.

Thanks again for starting a discussion.

AlexandreEichenberger commented 2 years ago

For ref tosa_spec.pdf

sjarus commented 2 years ago

Thanks for the detailed response @AlexandreEichenberger!

An approach build on TOSA may indeed provide another path to optimized code.

Yes it can indeed be an option - when we implemented enough of a critical mass of TorchToTosa, we simply leveraged TosaToLinAlg (primarily authored by @rsuderman here) and it 'just worked' e2e. TosaToSomethingElse can also be done for a backend that needs it.

One thing with ONNX is that operations are at time purposefully defined in a very generic way. To give you an example, AveragePooling and Conv are defined as n-dimensional operations.

This is also seen in PyTorch, which isn't surprising. Our expectation is that for 1D variants it should be just about expressing the kernel dims accordingly. For >2D, it's a little trickier but potentially doable with reduce_sum along channel-dim. Relative to 2D, it's not the common case.

For the AvgPool floor/ceil mode example, I actually have a todo for ceil_mode in the PyTorch equivalent: https://github.com/llvm/torch-mlir/blob/main/lib/Conversion/TorchToTosa/TorchToTosa.cpp#L2581 , but it can be done fairly easily in that construct by passing the ceil_mode into getOutputDim . There's also code in TorchToTosa to handle calculation of -ve dims in parameters.

I expect that @stephenneuendorffer will find a lot of the existing TorchToTosa code readily usable for ONNX->TOSA as well. I currently have an RFC to move the common TOSA constructor code into MLIR core beside the dialect: https://discourse.llvm.org/t/rfc-moving-tosa-construction-lib-into-mlir-core/4961 . The legalize_utils/legalize_common files on the TensorFlow side and TosaLegalizeCommon/LegalizeUtils code on the Torch side are identical - they'll be just as useful for ONNX legalization and of course the support capabilities around dynamic shape or common carried capabilities more easily.

I could not tell by a quick scanning of the TOSA document, but I assume that "runtime" dimensions are also supported.

Yes we have a lot of ongoing work around this. @stellaraccident started a conversation on the TOSA discourse about it: https://discuss.mlplatform.org/t/extending-tosa-to-support-symbolic-shapes/82 . We would like to expand the envelope of dynamic shape within TOSA using what exists or is being developed within MLIR, to serve TOSA consumers' needs. This is an active ongoing area of work.

When using TOSA, do you intend it to be TOSA exclusively, or TOSA and our current path for the unsupported operations.

It's expected that not everything can be supported because some ops are beyond TOSA's ability to express, e.g. NonMaxSuppression is a commonly quoted example. TOSA has a custom op to enable this path - this is in fact a recent conversation that I intend to take to the public discourse for feedback and agreement.

Solutions to these in TOSA would ideally be available to all ingress paths - with this proposal including ONNX as well.

Regarding versioning, indeed this is challenging. This also affects other projects like Torch-MLIR and there's scope to develop a common MLIR based approach here that solves these problems for the community.

Thank you for the information on testing. I'm really interested in how the ONNX runtime is invoked and interacts with onnx-mlir. Any documentation on this would be very helpful.

AlexandreEichenberger commented 2 years ago

Lots of information, at some time we will need to go in depth step by steps.

On your last question, the ONNX runtime, which is really just creating/deleting/querying of OMTensor and OMTensorList, is described here

http://onnx.ai/onnx-mlir/doxygen_html/OMTensor/_o_m_tensor_8h.html
http://onnx.ai/onnx-mlir/doxygen_html/OMTensorList/_o_m_tensor_list_8h.html The actual header files are located in the onnx-mlir/include subdirectories The actual implementation files are located in the onnx-mlir/src/Runtime subdirectories

Note that we only expose the interface (in C to accommodate python/java interfaces), not the implementation. However, the onnx-mlir compiler does generate calls to these interfaces as well to manage inputs/outputs to the model. These are located in the KrnlToLLVM conversion, I believe (I am not too familiar with this step, others are).

What support does TOSA use. In the MLIR IREE presentation last week, there were references to a "generic" implementation for tensors and lists.

caoimhinuibrian commented 2 years ago

This looks like an interesting project.

sjarus commented 2 years ago

What support does TOSA use. In the MLIR IREE presentation last week, there were references to a "generic" implementation for tensors and lists.

This is a question with multiple answers.

First, TOSA has the reference model described earlier. This is written to leverage Eigen, and can be fundamentally viewed as a functional non-performant runtime. We have a pass (https://git.mlplatform.org/tosa/tosa_mlir_translator.git) that converts TOSA MLIR to flatbuffers serialized form. The reference model consumes this form, and is a simple graph executor.

This is backed internally by a unit test generator that emits 20000+ unit tests for 100+ legalizations each in TensorFlow & TensorFlow Lite , and a fewer number for PyTorch - TOSA (which only began development in Nov 2021). One of the reasons to learn about the ONNX runtime interface is to potentially look at leveraging this test setup for ONNX->TOSA as well. We are also working towards getting these unit test harnesses into their corresponding frontend CI flows.

This isn't end to end, since the TOSA form is the last stop. However, it enables critical golden reference / conformance testing of the TOSA form that benefits both e2e sw stack development and the validation of TOSA hardware design - both being active use cases for Arm.

Second, there's the full e2e path, where TensorFlow or PyTorch (or ONNX through this proposal) can go through TOSA, LinAlg, and subsequently LLVM based codegen. The torch-mlir e2e suite leverages the TOSA->LinAlg path in this manner. This is where the active existing connectivity into the MLIR ecosystem lets a full stack come together quickly once pieces are in place.

The presentation last week attempts to clarify the e2e picture and address missing pieces in the existing torch-mlir path and based on the IREE team's own experience - @stellaraccident is best placed to describe these in a little more detail, but this week's ODM is also intended to continue this conversation.

Third, there are interesting novel paths consuming TOSA, e.g. through the EmitC dialect as described in this thread: https://discourse.llvm.org/t/tosa-reference-model-from-mlir-using-emitc/4799 . @marbre is working on this and it offers an interesting pathway e.g. for constrained baremetal runtimes.

etiotto commented 2 years ago

So far we lower the ONNX dialect to our own internal Krnl dialect. Adding integration with TOSA might allow us to lower ONNX operators to TOSA without impacting/breaking the current support we have. I hope that over time we will be able to either lower certain ONNX operator to TOSA, and other to Krnl, or lower some Krnl operators to (a combination of TOSA operators). I think the choice will be dictated by the level of performance we might be able to achieve (assuming TOSA operators are optimized already).

Overall I think this proposal looks interesting. It certainly will foster closer collaboration in our communities, and will help consolidate the ecosystem a bit.

AlexandreEichenberger commented 2 years ago

@sjarus

Trying to rephrase your answer to make sure I understood well.

For TOSA, you have to ways to implement it via MLIR.

One way is to emit for TOSA operations calls to Eigen (or similar library).
Another way is to lower the TOSA ops to lower MLIR ops (linalg, affine, sci, standard, vector,...). Is that correct? If so, which solution are you interested in? 1, 2, or a combination of 1 & 2? It may have implications on what needs to be done in onnx-mlir.

I come back to my very first question: what do you do when an op in ONNX is not supported?

If we have a hybrid model, where some ops go to TOSA and others go to ONNX-MLIR, then we need to ensure that there is no issue with our thin runtime. Input and output tensors to a model must be of the OMTensor and OMTensorList at this time. It's a wrapper that write the dimensions of the tensor in addition to the actual data. Actual data within the model are simply allocated arrays accessed by memref load/store at the MLIR level, eventually lowered to what LLVM does with arrays to go through Opt and LLVM passes.

sjarus commented 2 years ago

Is that correct? If so, which solution are you interested in? 1, 2, or a combination of 1 & 2? It may have implications on what needs to be done in onnx-mlir.

Both. 2 is more easily understood, but I'd like to describe how 1 can be useful: let's say an accelerator or FPGA is being developed to be TOSA compatible. The TOSA reference model becomes a golden reference in this context, enabling random test and validation of the in-development hardware. Further, it affords a validation midpoint for backend software development, to ensure the final codegen is correct. This also offers value by addressing (a large part of) the frontend stack development cost of such hardware.

what do you do when an op in ONNX is not supported?

The tosa.custom op is intended to enable this - it affords a path to express non-TOSA content. However this is WIP - I had some early consultations and am working on a proper RFC soon. I'll include your feedback and would like to invite your feedback then - we have essentially the same question for PyTorch and TensorFlow as well.

Expanding a little further including @etiotto's comment:

I hope that over time we will be able to either lower certain ONNX operator to TOSA, and other to Krnl, or lower some Krnl operators to (a combination of TOSA operators).

The high level idea is to have two paths indeed. For example:

Common case frontend ops -> TOSA -> one-level-down dialect (e.g. Krnl or Linalg)
Unsupported frontend ops -> one-level-down dialect

The key here - which I'll cover in more detail in an RFC on MLIR discourse - is that the unsupported ops move into the body of tosa.custom as a region during the first step - with the sequencing of passes, the end result is that all content is then the one-level-down dialect as the two paths come together.

The benefit of TOSA here is twofold - a) ability to largely leverage multiple input frameworks through TOSA, with a single investment in the TOSA -> one-level-down dialect . For LinAlg, this is enabled by the TosaToLinalg pass. b) ability to focus on optimizing the common case path. An explicit hardware design isn't strictly necessary - TOSA can serve as proxy for the common case sequences to optimize codegen for.

TOSA can fully express a large number of real world networks, e.g, multiple versions of Inception, MobileNets, ResNet, ResNext, DeepSpeech, EDSR, FSRCNN, YOLO, Wav2Letter, BERT and several others. Our test suite is ~50 such networks, both fp and quantized int. Legalizations are added as needed - if a new network has an op that can be expressed then it's added, and the reference model gives us proof of correctness to proceed to performance optimization.

AlexandreEichenberger commented 2 years ago

@sjarus are you working on a prototype and/or PR? Would you like to chat to iron remaining issues that you may have? Let us know.

sjarus commented 2 years ago

@AlexandreEichenberger yes I am! I recently rebased against existing onnx-mlir top of tree and was able to get the previously implemented proof of concept functionality to compile and run.

My current plan is to clean this up and make a PR. I also had a brief coordination conversation a couple of weeks ago with a couple of other community members who are interested in driving this effort, to sync with their timelines. I've implicitly aligned my own priority on this task to their expected timelines, but I'll have a PR out very soon.

AlexandreEichenberger commented 2 years ago

Great news, and maybe once your PR lands, we can schedule a conversation if the PR is large.

sjarus commented 2 years ago

Yes let me get a sense of the size of the PR after rebasing against the current head, and we can discuss.

Perhaps it makes sense to do an initial PR of just the build integration with a skeleton legalization, and then add real legalizations to that separately.

AlexandreEichenberger commented 2 years ago

Smaller PRs are welcome, if it results in a lot more work on your side, we are flexible.

sjarus commented 2 years ago

PR is in https://github.com/onnx/onnx-mlir/pull/1379 . I've attempted to keep it to a manageable size focused on build integration.

onnx / onnx-mlir

[RFC] Contributing an ONNX to TOSA legalization pass to onnx-mlir #1194