microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.65k stars 2.93k forks source link

[Mobile] Model output is different in React Native compared to NodeJS #19353

Open pax-k opened 9 months ago

pax-k commented 9 months ago

Describe the issue

I'm using Xenova/all-MiniLM-L6-v2 to extract embeddings from sentences. Given this inference code, I execute it as is in both NodeJS and React-Native (in RN with a slight difference in how loading the model is made).

The NodeJS outputs are good. The problem is that I get slightly different vector embeddings in React Native, using the same code.

Things to note from the inference code:

To reproduce

Setup iOS:

git clone https://github.com/pax-k/expo-onnx
cd expo-onnx
yarn
npx expo prebuild
npx pod-install
yarn ios
// copy the sortedSentences JSON object from the console

Setup NodeJS:

git clone https://github.com/pax-k/nodejs-onnx
cd nodejs-onnx
yarn
npx ts-node --esm onnx-minilm.ts
// copy the sortedSentences JSON object from the console

I combined and analyzed the 2 JSONs in this project:

Screenshot 2024-01-31 at 14 17 42

a is from NodeJS b is from React Native

Observations:

Urgency

It's pretty urgent

Platform

React Native

OS Version

iOS 17.0.1, iPhone 15 Pro Simulator

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-react-native

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

JavaScript

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

simonwh commented 9 months ago

Just chiming in to say I can replicate this on my machine. I hope @skottmckay or someone else from the team can give a few hints to why this might be happening!

skottmckay commented 9 months ago

My initial thought based on looking at the json output is that this is pretty typical and expected when there are lots of floating-point operations being executed on different platforms.

If you look at the model in Netron there are many MatMul operations. The order of the individual operations will affect the exact value produced by that node. There are many additions and multiplications in a single MatMul. The order of each set of multiplications matter. The order those are added together matter. But there's no rule about the order. i.e. mathematically a x b x c == c x b x a, but you'll get two different results due to how floating-point numbers work.

The low-level instructions used to execute the operations differ by platform/architecture (e.g. various AVX instruction sets on intel/amd, NEON on arm, etc.). These differences accumulate with each node and are magnified by nodes that do a lot of calculation (e.g. MatMul/Conv/Gemm).

When you say the NodeJS output is 'good' but the React Native ones aren't how is that assessed? By the output from using the embeddings in a downstream model? Or is it that the floating-point values differ beyond some expected tolerance?

simonwh commented 9 months ago

@skottmckay

We’ve been comparing both the vectors produced by the model directly by loading it with sentence_transformer in python vs. ONNX in node and here we see the exact same results. That’s why we say they look “correct”.

We then compared with ONNX in node vs. react-native and saw widely different results for some inputs.

We understand that floating point calculations can vary slightly on different architectures, but we didn’t expect to see discrepancies this big and seemingly random.

In the comparison chart, we show the Manhattan distance between vectors produced by ONNX node vs ONNX react-native. For most, you can see that the difference is really small (< 0.001), and expected due to differences in floating point math.

However, for a few vectors you will see a huge difference, eg.:

What can be the contributing factor to these difference? The inconsistency creates a blocker for us to put this in production.

pax-k commented 9 months ago

@skottmckay

When you say the NodeJS output is 'good' but the React Native ones aren't how is that assessed? By the output from using the embeddings in a downstream model? Or is it that the floating-point values differ beyond some expected tolerance?

You are correct, we mean that the floating-point values differ beyond some expected tolerance.

In this table we compare ONNX embeddings in NodeJS (on the left) with ONNX embeddings in RN (on the right). Given a query and list of sentences, we calculate the cosine similarity and sort the sentences to show the most similar first. The ranked NodeJS sentences on the left feel right, while the RN ones are different. We expected they would be identical, and the table shows how results are affected in RN.

Screenshot 2024-02-05 at 16 21 52
skottmckay commented 9 months ago

I would suspect that python and nodejs are hitting the same low level code if the results are the same.

As an additional data point, can you run your evaluation on another platform like x64 desktop or an actual iOS device instead of the simulator? Or alternatively you could enable the XNNPACK EP as an alternative implementation of MatMul on CPU (vs. ORT's MLAS library which is used by default)

pax-k commented 9 months ago

@skottmckay Thanks for the tip! 🙌

Good news: if i use coreml (and not cpu) as an execution provider for ONNX in react-native iOS, then the ranked sentences match the nodejs ones! I also tested nnapi for Android, but it's as good as cpu.

Hopefully we can roll a text embedding model into production soon without issues.

Thank you for your patience! 🙏🏻

skottmckay commented 9 months ago

Excellent.

If you ever want to dig really really deeply into it you can do a custom build with a flag to output the result of individual nodes to compare platforms and see how things change throughout the model.

https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs

You could set ORT_DEBUG_NODE_IO_OP_TYPE_FILTER to limit to just the MatMul nodes.

pax-k commented 9 months ago

@skottmckay Could a custom build improve performance on iOS?

We did some benchmarks and seems that CoreML is twice as slow than CPU on iOS (but CPU is not precise, so we don't need it).

This is a benchmark for calculating the average inference time on-device (iPhone 12 Pro) with CoreML, using Jina embeddings model:

 LOG  Model loaded successfully: jina-embeddings-v2-small-en
 LOG  Runs: 100
 LOG  Text size (chars): 1099
 LOG  Download time (s): 4.02
 LOG  Load time (s): 0.756
 LOG  Output dims: 512
 LOG  Average times (s): {
  "tokenize": 0.0025299999999999997,
  "inference": 0.21671
}

We get ~216ms inference time on iOS + CoreML, compared to:

We also tested other models, like bge-small-en-v1.5, gte-small, e5-small-v2, but jina-embeddings-v2-small-en was the fastest.

Are you aware of any tweaks we could try to improve inference times when using CoreML?

Thanks!

skottmckay commented 9 months ago

Depends on the operations in the model. It will be slower if there are unsupported operators breaking up partitions between CoreML and CPU EP. Run with the log severity level set to 'VERBOSE' (0) in the session options and look for 'Node placements' in the output.

FWIW I'm not convinced any particular EP is more 'precise'. I think they're all producing valid output, and saying one is more precise feels a little arbitrary and based on whether you liked the results for a specific query more or less for that EP. i.e. appointing the NodeJS output to be the measure of precision baseline may be flawed. if you run a wide range of queries the 'best' set of results for each may cycle between all the EPs you test with.