tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.49k stars 420 forks source link

ONNX import regression: `Invalid shape for broadcasting - the dimension from the 2. to last position has conflicting values 3 and 32 from different inputs` #2243

Open SimonBrandner opened 2 weeks ago

SimonBrandner commented 2 weeks ago
ERROR burn_import::logger: PANIC => panicked at /home/simon/.cargo/git/checkouts/burn-178c6829f420dae1/6b51b73/crates/onnx-ir/src/dim_inference.rs:850:33:
Invalid shape for broadcasting - the dimension from the 2. to last position has conflicting values 3 and 32 from different inputs    

When updating burn in my test repo to the newest commit (6b51b73a5f8411332f90d1c60e4b8f88de0fe3db) from c94e7438293b1ce441f75fbed8dd651ff1b54b92, I get this error

laggui commented 2 weeks ago

Broadcasting checks were added in #2213.

Seems like your test repo contains multiple models, can you point to the one with the flagged regression? Gotta identify which node with broadcasted input support is causing this error.

SimonBrandner commented 2 weeks ago

I believe it is this one: https://github.com/SimonBrandner/burn-import-testbed/blob/main/models/recognizer.onnx

hexd0t commented 2 weeks ago

I'm trying to look at your examples, but downloading the ONNX models fails with

> git lfs fetch
fetch: Fetching reference refs/heads/main
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/SimonBrandner/burn-import-testbed.git/info/lfs'

Are the models hosted somewhere else?

Can you tell which node is being evaluated by looking at the log before the panic (should be one of the Ops capable of broadcasting), and take a look at the Input Tensor shapes using Netron, similar to the following screenshot:

{43F42A54-463E-42B3-8E08-0A4D1A0549A2}

According to the panic message, there are Tensors with different lengths > 1 for the same dimension, e.g., a Tensor [3, x, y] is broadcast together with a Tensor of shape [32, x, y], which is invalid according to ONNX spec (although the validation on import might also have a bug):

In ONNX, a set of tensors are multidirectional broadcastable to the same shape if one of the following is true:

The tensors all have exactly the same shape. The tensors all have the same number of dimensions and the length of each dimensions is either a common length or 1. The tensors that have too few dimensions can have their shapes prepended with a dimension of length 1 to satisfy property 2.

SimonBrandner commented 2 weeks ago

Oh, sorry about that... I'll probably zip them at some point, Git LFS is a pain... The model can be downloaded from here but you might need to update the opset...

Let me know, if this is enough for you to proceed! (I can try to help further, if necessary, later)

hexd0t commented 2 weeks ago

Thanks for making it available, I can reproduce the error (the screenshot already has some debug info added):

{76C349CA-245B-46BB-BE7C-A985E828DC06}

Looking at the model in Netron, the Operation gets a valid input for broadcasting (just checking that the panic is actually a false positive):

{6B0F4C32-D188-4A21-A4BC-A0AC3661F0CC}

Comparing the Shapes that are processed by the broadcast validation vs. those seen in Netron, it seems like the input values to the dim_inference step for the PRelu are already bad, and the recently added validation just exposes this problem - instead of a problem inside of the added check itself.

I'll see whether I can locate the source of the mismatched input shape; just wanted to post a status update / help others get started if they also want to take a look.