microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.83k stars 2.94k forks source link

[Web] 1.20.0 breaks SkipSimplifiedLayerNormalization backwards compatibility. Missing Input: model.layers.0.input_layernorm.weight #22704

Open xenova opened 3 weeks ago

xenova commented 3 weeks ago

Describe the issue

After upgrading to onnxruntime-node 1.20.0, I obtain the following error when trying to run models which were previously exported (and working) with earlier versions of onnx/onnxruntime:

Non-zero status code returned while running SkipSimplifiedLayerNormalization node. Name:'/model/layers.0/post_attention_layernorm/SkipLayerNorm' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:42 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: model.layers.0.input_layernorm.weight

To reproduce

Attempt to run one of the following models:

Urgency

Blocks upgrading transformers.js to use onnxruntime-node v1.20.0

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.0

Execution Provider

'wasm'/'cpu' (WebAssembly CPU)

xenova commented 3 weeks ago

It also breaks for WASM EP (WebGPU still works): https://jsfiddle.net/9v4fa3gw/

fs-eire commented 3 weeks ago

@xenova Thank you for the issue report! I did some investigation and identified the issue is in the CPU implementation of f16 [Skip][Simplified]LayerNormalizaion.

This issue is not web specific. All language binding may run into this issue if using CPU/f16 on any of the 4 operators.

We will do the fix ASAP and will publish a dev build of onnxruntime-web once it's done. Will also work on a patch release to include the fix.

xenova commented 3 weeks ago

@fs-eire Amazing - thanks so much! 🥳 I'll upgrade the build when you're ready 👍 Will this include a dev version of onnxruntime-node too?

oyazdanb commented 3 weeks ago

@xenova Thank you for the issue report! I did some investigation and identified the issue is in the CPU implementation of f16 [Skip][Simplified]LayerNormalizaion.

This issue is not web specific. All language binding may run into this issue if using CPU/f16 on any of the 4 operators.

We will do the fix ASAP and will publish a dev build of onnxruntime-web once it's done. Will also work on a patch release to include the fix.

I see this issue in Onnxruntime-DirectML as well; Does this fix help with ort-dml?

fs-eire commented 3 weeks ago

Will investigate this issue. It looks like the problem is in CPU EP and

@fs-eire Amazing - thanks so much! 🥳 I'll upgrade the build when you're ready 👍 Will this include a dev version of onnxruntime-node too?

Currently the pipeline does not support this but I can do a manual publish if necessary.

fs-eire commented 3 weeks ago

@xenova Thank you for the issue report! I did some investigation and identified the issue is in the CPU implementation of f16 [Skip][Simplified]LayerNormalizaion. This issue is not web specific. All language binding may run into this issue if using CPU/f16 on any of the 4 operators. We will do the fix ASAP and will publish a dev build of onnxruntime-web once it's done. Will also work on a patch release to include the fix.

I see this issue in Onnxruntime-DirectML as well; Does this fix help with ort-dml?

I am not sure if the problem that you saw is exactly caused by this. If it is, the fix should help.

xenova commented 3 weeks ago

Currently the pipeline does not support this but I can do a manual publish if necessary.

Yes please! 😇 Transformers.js v3.1.0 will include this fix

fs-eire commented 3 weeks ago

The fix is being worked on, and we want to make sure the change fixes the problem before it's merged.

@xenova could you please help to verify if the fix works? (replace the dist folder -> dist.zip)

xenova commented 3 weeks ago

Here is the new error message I get now:

failed to inference ONNX model: Error: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running SkipSimplifiedLayerNormalization node. Name:'/model/layers.0/post_attention_layernorm/SkipLayerNorm' Status Message: gamma is expected to have 1 dimension, got 0.

fs-eire commented 2 weeks ago

Did some update and this is the latest fix -> dist.zip

xenova commented 2 weeks ago

Great! That fixed it @fs-eire 🥳 Please let me know when you put a dev build out 👍

xenova commented 1 week ago

@fs-eire I see it was merged in https://github.com/microsoft/onnxruntime/commit/f0ac5e0d3dc07d519f4476682c03f33fc79e0eb5. could you put out a dev build for onnxruntime-node? 😇

jywu-msft commented 1 week ago

@fs-eire I see it was merged in f0ac5e0. could you put out a dev build for onnxruntime-node? 😇

+@guschmue

ulgens commented 1 day ago

👀