iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 573 forks source link

Does IREE support custom ML accelerators? #6720

Closed uccoder-hk closed 3 years ago

uccoder-hk commented 3 years ago

Hello,

Though IREE is aimed to support mobile & embedded systems, But currently it seems to be targeting CPU & GPU only. Does IREE is planning to custom ML accelerators which are coming with mobile SoC's?

benvanik commented 3 years ago

We (the IREE team) aren't working on any custom discrete ML accelerator support at this time but the system is designed to support more progressive devices as they start to exist. It's possible for anyone to plug in a new compiler target that lowers the MLIR linalg dialect to their hardware and add a runtime HAL backend that communicates with the device APIs. The bigger issue is that most ML accelerators are quite restrictive and targeting them with anything but the most simple of high-level op compilers is extremely difficult. We're happy to aid someone interested in taking on the linalg->accelerator path similar to how we have linalg->spirv/llvm(->cpu/cuda/rocm)/etc, with the caveat that for most cases where it's possible to build such a path it's likely one of those existing paths can be used instead.

For background, given the fixed-function limited scope of most of these devices and proprietary compiler/drivers stacks we believe (and are working to ensure) that most of what is serviced with discrete ML accelerators will morph into functional units/instructions attached-to or embedded-within existing hardware (CPUs with more flexible vector instructions/GPUs with cooperative matrix/tensorcore-like units, etc) - or that the accelerators will gain general purpose compute and blur the lines from that direction (like modern GPUs). It's like how nowadays the thought of a dedicated floating-point coprocessor like the x87 - or a dedicated fixed-function GPU attached to a mobile SoC instead of being on-die and programmable - is bizarre we believe so too should be dedicated fixed-function ML coprocessors. In many ways ML systems design (both hardware and software) are lagging behind GPUs by 10-15 years and we're too impatient to wait that long for it to catch up :)

It doesn't mean that there aren't pragmatic reasons to support some of the existing shipping solution today but it's not something we find ourselves benefiting from in the models that are driving IREE's development: usually complex streaming models with low latency requirements. In these cases the inability for a single type of operation (sometimes as simple as elementwise arithmetic!) to happen on one of these inflexible devices requires an additional host<->device round trip which can eliminate the entire performance/power benefit of the device. Big batched workloads of older/specialized model architectures that fit into the constraints of these devices may have more favorable cost/benefit tradeoffs but today we direct people towards existing solutions like XLA+TPU that are almost exclusively good at those workloads. In time something more general-purpose like IREE will subsume these workloads too but since there are workable solutions there today we are focused instead on where there are gaps. Recently we've been finding that people are even starting to run more flexible frameworks on servers on the CPU even if the specialized hardware is available because the performance difference does not outweigh the model architecture restrictions. Our goal in the short-term is to be fast enough that it rarely makes sense for someone looking to ship ML at small to medium scale to spend the engineering years to work in the more inflexible systems. Fun times :)

stellaraccident commented 3 years ago

I am not aware of any public/released projects for such parts. Various NDAs keep us from talking about non public/released projects. Sorry to be vague, but this is all I can say at this point.

uccoder-hk commented 3 years ago

Thank you for your response.

May be I was wrong in framing the question, sorry for that. Intension was to get know whether the IREE design allows porting it to custom ML accelerators in future or is it restricted only to CPU & GPU.. Basically I was intended to know is there anything similar to TVM BYOC approach in IREE..

From benvanik answer, It seems to be possible. "It's possible for anyone to plug in a new compiler target that lowers the MLIR linalg dialect to their hardware and add a runtime HAL backend that communicates with the device APIs. "

powderluv commented 3 years ago

We at Nod.ai (https://nod.ai ) do have MLIR/IREE support for ML Accelerators in production. Similar to @stellaraccident above we can't talk about unreleased accelerators but we do have a prototype of porting IREE to Open Source Hammerblade (https://github.com/bespoke-silicon-group/bsg_bladerunner) which is a systolic array of RISC-V cores. The PoC is here: https://github.com/spaceotter/iree/blob/28af5247f37708d558a69faf4393c03c93057301/iree/compiler/Dialect/HAL/Target/HammerBlade/HBTarget.cpp#L302 . It is not complete and requires some more changes to the SoC / APIs and we can push that up https://github.com/NodLabs/bsg_manycore if there is interest.

If you decide to run that code and need help feel free to ping with any questions.

cycheng commented 3 years ago

Thanks for sharing!

uccoder-hk commented 3 years ago

We at Nod.ai (https://nod.ai ) do have MLIR/IREE support for ML Accelerators in production. Similar to @stellaraccident above we can't talk about unreleased accelerators but we do have a prototype of porting IREE to Open Source Hammerblade (https://github.com/bespoke-silicon-group/bsg_bladerunner) which is a systolic array of RISC-V cores. The PoC is here: https://github.com/spaceotter/iree/blob/28af5247f37708d558a69faf4393c03c93057301/iree/compiler/Dialect/HAL/Target/HammerBlade/HBTarget.cpp#L302 . It is not complete and requires some more changes to the SoC / APIs and we can push that up https://github.com/NodLabs/bsg_manycore if there is interest.

If you decide to run that code and need help feel free to ping with any questions.

Thanks for sharing...