tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
7.94k stars 381 forks source link

Trouble importing FaceONNX detector model: `Only tensor indices is valid` #1915

Open SimonBrandner opened 1 month ago

SimonBrandner commented 1 month ago

Describe the bug

Error during importing the detector model:

ERROR burn_import::logger: PANIC => panicked at /home/simon/Data1/GIT/Rust/burn/crates/burn-import/src/onnx/dim_inference.rs:820:14:
Only tensor indices is valid    

To Reproduce

  1. Download detector model from https://drive.google.com/file/d/1tB7Y5l5Jf2270IisgSZ3rbCQs0-pkNIQ/view?usp=drive_link (also attached down below)
  2. Move the file to src/model/detector.onnx
  3. Add this to build.rs
    ModelGen::new()
        .input("src/model/detector.onnx")
        .out_dir("model/")
        .run_from_script();
  1. Build and see error

What I've figured out so far

Attachment: face_detector_640.onnx.zip

laggui commented 1 month ago

As you correctly pointed out the issue is with the gather node.

We check for the inputs to make sure they are in accordance with the spec, but when checking for the indices we expect a tensor but your model uses a scalar (rank 0 tensor) - which is still valid according to the spec but currently will run into issues with the rest of the codegen for Burn (see issue #1689).

laggui commented 1 month ago

Also I just realized your model was generated with an older opset (v9) and we might not support all of the ops there. Right now the ops we have should support the latest ops at opset 13. Some also have support for older opset but not all (e.g., Slice which I see in your model is opset v1 and we don't parse the attributes).

The ONNX spec is large and our bandwidth for that is a bit limited.

antimora commented 3 weeks ago

We should probably fix this to support older opset.

As a work around, can you convert to a new opset? There are some tools to upgrade the onnx file.

SimonBrandner commented 3 weeks ago

I tried this and it did not help... That said from the comments my understanding is that the Only tensor indices is valid issue isn't due to an old opset, is it?

SimonBrandner commented 3 weeks ago

This dirty fix seems to be working:

diff --git a/crates/burn-import/src/onnx/dim_inference.rs b/crates/burn-import/src/onnx/dim_inference.rs
index 94d66d22..c404829c 100644
--- a/crates/burn-import/src/onnx/dim_inference.rs
+++ b/crates/burn-import/src/onnx/dim_inference.rs
@@ -808,11 +808,16 @@ fn gather_update_outputs(node: &mut Node) {
     };

     let indices_tensor = match &node.inputs[1].ty {
-        ArgType::Tensor(tensor) => tensor,
-        _ => panic!("Only tensor indices is valid"),
+        ArgType::Tensor(tensor) => tensor.clone(),
+        ArgType::Scalar(scalar) => TensorType {
+            elem_type: scalar.clone(),
+            dim: 0,
+            shape: None,
+        },
+        _ => panic!("Only tensor or scalar indices are valid"),
     };

-    if indices_tensor.dim != 1 {
+    if indices_tensor.dim > 1 {
         panic!("Gather: indices tensor rank above 1 not supported")
     } 

but then I run into this:

ERROR burn_import::logger: PANIC => panicked at /home/simon/GIT/Rust/burn/crates/burn-import/src/onnx/op_configuration.rs:302:9:
Flatten: input tensor must have at least 2 dimensions (got 0)    

(btw, is there a Discord server/Matrix space/whatever which would be more suitable for debugging this?)

SimonBrandner commented 3 weeks ago

As far as I can tell, according to the docs, the input tensor really must hast at least 2 dimensions which makes me ask how is it I am getting a 0D one...

https://onnx.ai/onnx/operators/onnx__Flatten.html

laggui commented 3 weeks ago

(btw, is there a Discord server/Matrix space/whatever which would be more suitable for debugging this?)

Absolutely! Come join us here.

laggui commented 3 weeks ago

I tried this and it did not help... That said from the comments my understanding is that the Only tensor indices is valid issue isn't due to an old opset, is it?

Yeah it's not just because of the opset. Right now the first errors you're seeing are due to the rank 0 tensor. When I checked your model last time the gather node used a single value to gather and then it was unsqueezed to a 1-dim tensor.

We currently don't support 0-dim tensors, so even if we try to work around it by returning a 1d tensor I don't think the rest of the codegen will match. We would have to add support for that.

Finally, you do have some ops that use an older opset which may not be supported (as pointed out in my last comment, some attributes which don't exist in new opsets are not parsed).

antimora commented 2 weeks ago

This part needs to be fixed: image

We need to support Gather, Unsqueeze, and Concat operations on Shape

antimora commented 2 weeks ago

Used onnx-simplifier (https://github.com/daquexian/onnx-simplifier) which removes lots of these crazy gather reshape operation, which were due to lacking capabilities of the older ONNX OpSet versions.

Here is updated version:

face_detector_opt.onnx.zip

Output of the tool:

[burn]% onnxsim ~/Downloads/face_detector_640.onnx ./face_detector_opt.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃                    ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add                │ 3              │ 3                │
│ BatchNormalization │ 35             │ 0                │
│ Concat             │ 13             │ 5                │
│ Constant           │ 240            │ 111              │
│ Conv               │ 52             │ 52               │
│ Div                │ 2              │ 1                │
│ Exp                │ 1              │ 1                │
│ Gather             │ 8              │ 0                │
│ Mul                │ 4              │ 4                │
│ Relu               │ 37             │ 37               │
│ Reshape            │ 8              │ 8                │
│ Shape              │ 8              │ 0                │
│ Slice              │ 6              │ 4                │
│ Softmax            │ 1              │ 1                │
│ Sub                │ 1              │ 1                │
│ Transpose          │ 8              │ 8                │
│ Unsqueeze          │ 24             │ 0                │
│ Model Size         │ 1.5MiB         │ 1.3MiB           │
└────────────────────┴────────────────┴──────────────────┘

Now, there is still a slight issue with Slice operation which does not support attributes. I am working on this fix.

image
laggui commented 2 weeks ago

Used onnx-simplifier (https://github.com/daquexian/onnx-simplifier) which removes lots of these crazy gather reshape operation, which were due to lacking capabilities of the older ONNX OpSet versions.

Here is updated version:

face_detector_opt.onnx.zip

Output of the tool:

[burn]% onnxsim ~/Downloads/face_detector_640.onnx ./face_detector_opt.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃                    ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add                │ 3              │ 3                │
│ BatchNormalization │ 35             │ 0                │
│ Concat             │ 13             │ 5                │
│ Constant           │ 240            │ 111              │
│ Conv               │ 52             │ 52               │
│ Div                │ 2              │ 1                │
│ Exp                │ 1              │ 1                │
│ Gather             │ 8              │ 0                │
│ Mul                │ 4              │ 4                │
│ Relu               │ 37             │ 37               │
│ Reshape            │ 8              │ 8                │
│ Shape              │ 8              │ 0                │
│ Slice              │ 6              │ 4                │
│ Softmax            │ 1              │ 1                │
│ Sub                │ 1              │ 1                │
│ Transpose          │ 8              │ 8                │
│ Unsqueeze          │ 24             │ 0                │
│ Model Size         │ 1.5MiB         │ 1.3MiB           │
└────────────────────┴────────────────┴──────────────────┘

Now, there is still a slight issue with Slice operation which does not support attributes. I am working on this fix.

image

Ha, that's what I suggested on discord. Glad it cleared up the redundant stuff 🙂

antimora commented 2 weeks ago

Submitted a PR for the slice issue: https://github.com/tracel-ai/burn/pull/1989

But I stumbled another bug: https://github.com/tracel-ai/burn/issues/1983 which I am working to fix.

Overall, I think I am very close to fixing it. The good news is that all ops seems to be working.

antimora commented 3 days ago

I got the face-onnx building, however, there is a run-time error that's probably introduced recently when we were refactoring tensors. @laggui did some work in the Tensor random initialization. Maybe he knows what's going on.

I am attaching all my artifacts here in my repository, which will be hosted temporarily:

https://github.com/antimora/face-onnx

Please continue debugging. I think the bug is in the Burn's core.

antimora commented 3 days ago

I got the face-onnx building, however, there is a run-time error that's probably introduced recently when we were refactoring tensors. @laggui did some work in the Tensor random initialization. Maybe he knows what's going on.

I am attaching all my artifacts here in my repository, which will be hosted temporarily:

https://github.com/antimora/face-onnx

Please continue debugging. I think the bug is in the Burn's core.

I have filed the model initialization bugs here: https://github.com/tracel-ai/burn/issues/2043

antimora commented 21 hours ago

Hi @SimonBrandner,

We have a working ONNX! Thanks to @laggui for fixing the latest bugs.

Please check out: https://github.com/antimora/face-onnx

Your original ONNX file went through a series cleanups using python scripts. We will have more permanent fix in upcoming #2008 PR. For now this should unblock you.

Please onnx inference example (https://github.com/tracel-ai/burn/tree/main/examples/onnx-inference and https://github.com/tracel-ai/models/tree/main/mobilenetv2-burn) on how to embed your weights or specify location on your weights file. In my example, it's generated into target directory.