[Mobile iOS] Run fp16 onnx model on CoreML EP

NickLucche commented 11 months ago

Describe the issue

Hey, I am trying to run an fp16 quantized onnx model (using https://github.com/microsoft/onnxconverter-common/tree/master) on iOS with CoreML EP, figuring I could save some space since inference will run at half precision on ANE. Unfortunately, none of the nodes are assigned to CoreML EP due to this check https://github.com/microsoft/onnxruntime/blob/0a3eb60b017f2a7d691f0a3ce155f42a59d63b6c/onnxruntime/core/providers/coreml/builders/impl/base_op_builder.cc#L115C21-L115C63, as my tensors are all fp16 now.

Excerpts of the verbose log:

[I:onnxruntime:, coreml_execution_provider.cc:93 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 0 number of nodes in the graph: 184 number of nodes supported by CoreML: 0

..
[V:onnxruntime:, base_op_builder.cc:122 HasSupportedInputsImpl] [Conv] Input type: [10] is not supported for now
[V:onnxruntime:, base_op_builder.cc:122 HasSupportedInputsImpl] [Mul] Input type: [10] is not supported for now
. . .

Is there any way to get around this? Is this something that could be supported in the future where PRs are welcome?

To reproduce

See above.

Urgency

No response

Platform

iOS

OS Version

[V:onnxruntime:, helper.cc:115 HasNeuralEngine] Current Apple hardware info: iPad13,8

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

None

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

C++/C

Architecture

ARM64

Execution Provider

CoreML

Execution Provider Library Version

No response

skottmckay commented 11 months ago

It's not currently supported by the CoreML EP. Do you have a production scenario where this would be required?

One big consideration is that if CoreML is not available, an fp16 model will perform badly on the fallback CPU execution provider. Quantizing to 8-bit will be more flexible is that is an option. https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

NickLucche commented 11 months ago

Do you have a production scenario where this would be required

Yep, I am generally trying to have the model best inference time and the lowest storage/memory footprint, assuming device has ANE (I don't care so much about fallback performance right now). Preliminary benchmarks show that running my model with CoreML support is ~10 times faster than CPU with default EP. Using CoreML directly without onnxruntime I can (obviously) achieve the same inference speed, but also have a smaller model, with an fp16 .mlpackage. But I would have to maintain the .objc interfaces.

Quantizing to 8-bit will be more flexible is that is an option.

Yes definitely, I've tried that and while the performance boost is there, it still won't match the accelerators (with CoreML).

Just wondering, but is there a particular obstacle in the support of loading fp16 weights from the proto aside from the poor perfomance on fallback cpu?

NickLucche commented 11 months ago

To quickly add on that, the fp16 model run on the fallback provider isn't performing so much worse than the fp32 (<10% slower) on this ipad device, which wouldn't be a problem at all for my use-case, optimizing for ANE.

skottmckay commented 11 months ago

Just wondering, but is there a particular obstacle in the support of loading fp16 weights from the proto aside from the poor perfomance on fallback cpu?

Obstacle is cost/benefit. Costs developer time to add/test/maintain support and when there are no known production scenarios where it would provide a benefit there's no reason to pay the cost.

That said, usage evolves and there may be more desirability these days to use fp16 for a balance between size and accuracy, so we can certainly look at adding support.

I'd like to understand the benefit of using onnxruntime in your scenario though because IIUC you only want to run the model on iOS with ANE. If someone else only wants to run their model on ANE (vs. having an onnx model that can run on multiple platforms like Android and web), why would they choose onnxruntime over direct usage of CoreML?

NickLucche commented 11 months ago

Benefit on my side would be the ability to use a single library for deploying the same cpp application to windows/ios/android using a single api without trading-off much on performance on each platform. So if you want that'd be development speed.

But I am really not here to push my use-case, I have huge respect for your product and the effort you put into it, I was really interested in the technical side as I'd be happy to see ort be able to leverage all accelerators at the best of their capabilities.

skottmckay commented 11 months ago

Ah ok - that makes sense vs. a pure ANE-only scenario. That would totally be ORT's strength and a scenario where we offer clear benefits over direct usage of CoreML.

FWIW we also have a lot of pre/post processing steps that can be added to the model to further reduce the platform specific code required.

https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/Example%20usage%20of%20the%20PrePostProcessor.md

There are some pre-defined usages in https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/add_pre_post_processing_to_model.py and examples in https://github.com/microsoft/onnxruntime-extensions/blob/main/tutorials/ (see *_e2e.py files).

It's a modular setup so you can pick and choose the pre/post processing steps required for your model to simplify getting an image, text, or audio into the input format the model requires, and also make the model output more easily consumable (e.g. select best bounding boxes for image recognition and draw them on the original image).

NickLucche commented 11 months ago

Thanks a lot for your time, I was aware of the library but still hadn't given it a go, I can see now how portability would benefit from factorizing the pipeline into common operations.

microsoft / onnxruntime