microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.48k stars 2.76k forks source link

CoreML EP inference result is improperly scaled #21170

Open frenetj opened 1 week ago

frenetj commented 1 week ago

Describe the issue

When running inference of a specific dynamic-shape image filter model using CoreML EP, output pixels are slightly shifted towards the bottom left of the image. Pixels at the bottom left are not shifted at all, while pixels at the top right are shifted by almost a whole pixel to the left & downwards.

I cannot reproduce the issue with small images (size of ~1024 pixels or less. The issue is quite apparent using a 2048x2048 colour noise as input.

Here the top right portion of the input and output images:

TopRightPixels_InVsOut

Here is the shift over the hole image (absolute difference of the input vs output pixels). Notice the shift is present in the hole image, but more pronounced in the top right area:

InVsOutAbsDiff

I will provide the specific model to Microsoft directly as it has some proprietary content.

I cannot reproduce this issue when using macOS' native CPU handler. The issue is also NOT reproducible when using the CUDA or TensorRT handlers on linux. The issue is also NOT reproducible with macOS's CoreML EP when setting the COREML_FLAG_USE_CPU_ONLY flag.

Note that I am however using the COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES flag. I am thus surprised to see rendering difference with the CPU implementation since the model uses dynamic shapes and should thus NOT run using CoreML.

To reproduce

On macOS, setup the CoreML EP with the COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES flag. Run the inference using the given model on a 2048x2048 image. Notice that the output pixels are shifted to the left and towards the bottom of the image.

Urgency

The issue is not urgent as we are currently using the native CPU implementation.

Platform

Mac

OS Version

Sonoma 14.5

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C

Architecture

ARM64

Execution Provider

CoreML

Execution Provider Library Version

No response

skottmckay commented 1 week ago

COREML_FLAG_USE_CPU_ONLY results in CoreML executing the same nodes using its reference CPU implementation. We set this as the MLModelConfiguration.computeUnits. The rest of the ORT CoreML EP code runs exactly the same. That would strongly suggest an issue with the internal CoreML handling of a large input when running on GPU/NPU.

COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES is applied on a per-node basis. Parts of the model may have fixed shapes leading to CoreML executing them. If you set the session logging severity to VERBOSE it will print out details of which nodes are/aren't assigned to CoreML. That would at least narrow down which CoreML operator could be going wrong.

skottmckay commented 1 week ago

This appears to be a CoreML NeuralNetwork specific problem. There are only a few Div and Sub nodes assigned to CoreML as the rest have dynamic input shapes. Most of those produce the expected output.

There are 2 Div nodes (Div_185 and Div_143) that end up doing 2 / (2048 - 1) (one for the height and one for the width). For some reason the NeuralNetwork Div is somewhat inaccurate for this floating point operation.

Python as a reference (double precision): 2.0 / 2047.0 = 0.0009770395701025891

EP Value name Value
CPU EP Mul_340 0.00097703957
CoreML NeuralNetwork Mul_340 0.00097751617
CoreML ML Program Mul_340 0.00097703957

That difference must become significant across all the other downstream operations in the model, leading to the output discrepancies. I would guess it comes down to floating point inaccuracies from 2 divided by a large number as to why smaller numbers for the height or width don't trigger the issue.

skottmckay commented 5 days ago

FWIW it's possible to get a good result from NeuralNetwork but the model would need to be updated and you might need some experimentation to figure out what works best.

If I scale down the input size value (the 2047 in this case) first, do the Div, and scale back up it's happy. Guessing it's due to the difference in floating point representation due to the range between '2' and '2047'.

e.g. scaling the 2047 by 1000 (arbitrarily chosen) would be a = 2047 / 1000, b = 2 / a, c = b * 1000