burn-import panic - Githubissues

doronnac commented 1 month ago

Describe the bug Hi there, trying to import some ONNX files causes the following panic: burn-import-0.13.2/src/onnx/from_onnx.rs:363:64: index out of bounds: the len is 1 but the index is 1

My guess is this has something to do with burn-import being wip. If that's the case, I'll be happy to give it a go at fixing it, with some tips if possible. I see this project as very valuable so I'd like to support it.

To Reproduce Repro

antimora commented 1 month ago

We will be happy to accept your contribution. You can launch contributor book to read up more on burn onnx.

You should try the main branch. We have over 20 new ops. All contributed recently.

alteredoxide commented 3 weeks ago

@doronnac Have you made any progress on this, or at least come up with a temporary fix/workaround? I am also running into this issue.

doronnac commented 3 weeks ago

@alteredoxide No, unfortunately. Can you share the model you’re having trouble with?

antimora commented 3 weeks ago

@doronnac Have you made any progress on this, or at least come up with a temporary fix/workaround? I am also running into this issue.

Please try main branch if you haven't yet. We made many updates since last release

alteredoxide commented 3 weeks ago

@doronnac Yeah, I have tried two different models:

I have also tried the quantized versions of both of these, but those result in entirely different errors.

doronnac commented 3 weeks ago

@antimora I’ve since taken a somewhat different approach, but I’ll try to help @alteredoxide as a learning experience. Obviously your help will be much more effective if you have the time for it. Have a lovely weekend.

alteredoxide commented 3 weeks ago

I just noticed that burn via crates is only up to version 0.13.2, but the main branch on github is at 0.14. I just tried building the first model (mxbai) after specifying the repo for both burn and burn-import in order to get the latest version... The result is that I'm no longer getting the index error, but instead I get the following:

  ERROR burn_import::logger: PANIC => panicked at /home/ao/.cargo/git/checkouts/burn-178c6829f420dae1/5a0c1dc/crates/onnx-ir/src/dim_inference.rs:819:9:
  Gather: indices tensor rank above 1 not supported

So that's cool :unamused: Are there any onnx models that you have successfully loaded? If so, which ones?

alteredoxide commented 3 weeks ago

@antimora Is there a particular reason for not supporting rank > 1 for the indices in dim_inference for Gather nodes? It seems there is already code in place to compute the output rank for any input shape, but it's coded to panic if indices_tensor.dim > 1? If I comment that block out, then parsing finishes for the model I'm trying to load. However, there is a different downstream issue that I don't think is related, but perhaps it is?...

DEBUG burn_import::burn::graph: Registering node => 'shape'    
  DEBUG burn_import::burn::graph: Registering node => 'constant'    
  ERROR burn_import::logger: PANIC => panicked at /path/to/burn/crates/burn-import/src/onnx/to_burn.rs:1245:18:
  Can't transform scalar to tensor.

antimora commented 3 weeks ago

@antimora Is there a particular reason for not supporting rank > 1 for the indices in dim_inference for Gather nodes? It seems there is already code in place to compute the output rank for any input shape, but it's coded to panic if indices_tensor.dim > 1? If I comment that block out, then parsing finishes for the model I'm trying to load. However, there is a different downstream issue that I don't think is related, but perhaps it is?...
DEBUG burn_import::burn::graph: Registering node => 'shape'    
  DEBUG burn_import::burn::graph: Registering node => 'constant'    
  ERROR burn_import::logger: PANIC => panicked at /path/to/burn/crates/burn-import/src/onnx/to_burn.rs:1245:18:
  Can't transform scalar to tensor.

There are a few active PRs trying to fix Gather related issues #2148 and #2149 by @laggui and @hexd0t. I am not sure these will fix you issue, but keep monitoring after the merge. You can share your ONNX file (or link) if sharable for testing/troubleshooting purposes.

laggui commented 3 weeks ago

Gather op actually translates to tensor.select(dim, indices), which is limited to 1d indices. So the dim inference computes the correct output shape, but the codegen is not implemented for rank > 1.

For that, we would need another PR. Quickly reading the op spec, I think it can be implemented by doing tensor.select(dim, indices[i]) for all the set of indices and then concatenating the results.

alteredoxide commented 3 weeks ago

@antimora @laggui Thanks for the quick replies. After reading the #2148 and #2149, I ended up landing on another related issue: #2115, which is the primary problem that I'm encountering (the scalar error aside, which seems to be solved by #2148). @laggui based on the linked spec, I mostly agree with what you said about concatenating the results of tensor.select for each sub-tensor in indices, though I think there is more logic required for indices rank > 2.

I'll be happy to work on this particular problem if nobody objects. I suppose further discussion should take place in #2115?

laggui commented 2 weeks ago

I just noticed that burn via crates is only up to version 0.13.2, but the main branch on github is at 0.14. I just tried building the first model (mxbai) after specifying the repo for both burn and burn-import in order to get the latest version... The result is that I'm no longer getting the index error, but instead I get the following:
  ERROR burn_import::logger: PANIC => panicked at /home/ao/.cargo/git/checkouts/burn-178c6829f420dae1/5a0c1dc/crates/onnx-ir/src/dim_inference.rs:819:9:
  Gather: indices tensor rank above 1 not supported
So that's cool 😒 Are there any onnx models that you have successfully loaded? If so, which ones?

A couple of onnx models have been successfully loaded by users, though I can't recall exactly which. We do have squeezenet on the models repo.

@antimora @laggui Thanks for the quick replies. After reading the #2148 and #2149, I ended up landing on another related issue: #2115, which is the primary problem that I'm encountering (the scalar error aside, which seems to be solved by #2148). @laggui based on the linked spec, I mostly agree with what you said about concatenating the results of tensor.select for each sub-tensor in indices, though I think there is more logic required for indices rank > 2.

I'll be happy to work on this particular problem if nobody objects. I suppose further discussion should take place in #2115?

Contributions are more than welcome! 🙏 The linked issue seems to be evolving based on the OP's progress, so I think we can keep the discussion here if needed.

alteredoxide commented 2 weeks ago

@antimora @laggui I have been going through the parts of the codebase that pertain to gather, and have both come to some conclusions, and a change that I have written a test for, but I'll hold off on creating a PR until I know that I'm not stepping on anybody's toes:

Findings

2149: is operating under the onnx spec for Gather, which we already know, but the gather operations throughout the codebase actually mostly follow the GatherElements spec, which is what the gather method in pytorch does -- I don't think that's a fault of burn, but rather just a confusing aspect of there not being a standard of gather between current frameworks.
Due to the finding of (1.), the output of gather_update_outputs() should actually just be the same as the index shape according to the GatherElements onnx spec.
The candle and tch implementations already depend on the functionality of those backends, which actually leads to a discrepancy issue:
- The tch implementation follows the spec, including the requirement of index and the input tensor only having the same rank, but not necessarily the same shape.
- The candle implementation (in candle-core-0.6.0) imposes a more strict rule of both tensors having the same shape, except for the ~~last~~ specified axis (aka dim) dimension.
The implementation for the ndarray backend currently follows the same (more strict) restriction as the candle backend.
I haven't been able to find any evidence that burn needs to have the restriction of a 1D index array, despite any other restrictions that exist in some implementations.
One possible exception is that I couldn't really tell what was going on with burn-jit for gather via GatherEagerKernel.

Changes that I've implemented and tested

I have modified the ndarray implementation to follow the onnx spec and the way tch does it, and this includes a validation function.
I have added two tests: one for the ndarray impl, and one for tch, both on the same two test cases: one for axis 0 and the other for axis 1. Both test cases are on 2D index arrays.
- I only added the tch test to illustrate that it's already working for 2D index arrays, and on shapes that are not the same as the input tensor.

You can see the changes I made here

laggui commented 2 weeks ago

@alteredoxide as you noted in your first point, tensor.gather(dim, indices) is not the equivalent of the Gather ONNX op. The closest matches that we have for these ops are:

Gather -> tensor.select(dim, indices) (but only works for 1D indices, hence the restriction)
GatherElements -> tensor.gather(dim, indices)

That probably explains the "discrepancies" you observed in the codebase.

alteredoxide commented 2 weeks ago

@laggui Thanks, I clearly pulled on the wrong thread, and falsely assumed your previous statement about mapping to select was more figurative than literal (as though the logic was essentially performing select). Nonetheless, I think the changes to the ndarray gather are useful, since they bring it inline with the GatherElememnts spec.

I believe I found the correct thread to pull on this time; can you confirm this is the right starting point? Thanks!

laggui commented 2 weeks ago

@laggui Thanks, I clearly pulled on the wrong thread, and falsely assumed your previous statement about mapping to select was more figurative than literal (as though the logic was essentially performing select).

That's ok 😉 to be fair, it is a bit confusing with the similar (but different) names.

Nonetheless, I think the changes to the ndarray gather are useful, since they bring it inline with the GatherElememnts spec.

While it is nice for the ndarray implementation, the tensor ops must be consistent across backends for Burn. So the ONNX spec is not what dictates the API functionalities.

I believe I found the correct thread to pull on this time; can you confirm this is the right starting point? Thanks!

Yep! That's where it should be handled, at the codegen level for the gather node 🙂

alteredoxide commented 2 weeks ago

@laggui I have implemented gather at the codegen level, and added a couple test cases... when ensuring that my branch is inline with the latest changes to main, I noticed that there is an onnx test for gather, which pulls from a generated model that fits the previous approach. Doing a little digging, I believe I found where that is generated. Can you please verify this is the file I should modify?

Additionally, is there anything you can think of, off the top of your head, that I need to consider/modify?

Thanks!

Edit: Assuming that is the file I should modify for the existing ONNX test to pass, it seems like I need to address something that I wanted to save until after an initial review of my current code: distributing this kind of "gather" to the different backends and the burn-import ops signatures (e.g. burn-tch/src/ops/base.rs and ops/tensor.rs), which then raises an issue of naming, since all the current gather implementations are actually for GatherElements. Thoughts?

laggui commented 1 week ago

Yes we generated the onnx file for the tests via this python script. There are also 2 other scripts used to generate different configurations of the gather node, so you can either 1) modify this file to include both the existing operator test and the new gather test for indices rank > 1 or 2) add a new file to generate a separate onnx file for the new test to be included.

The current tests are still valid. So if they're not passing something might be wrong in your modifications.

distributing this kind of "gather" to the different backends and the burn-import ops signatures (e.g. burn-tch/src/ops/base.rs and ops/tensor.rs), which then raises an issue of naming, since all the current gather implementations are actually for GatherElements.

I'm not sure exactly what you mean. Do you mean that you have implemented as a new operator entirely? I guess we will see once the PR is opened 😄

alteredoxide commented 1 week ago

@laggui I think I have it all figured out...

I modified the three py models for the different configurations (gather, gather_scalar, gather_shape), and regenerated the models.
I added a gather_onnx method to the tensor/api/numeric.rs module -- this is the situation I had in mind regarding the issue of naming: that module already has gather which implements the spec for GatherElements, rather than Gather
I did modify one of the configurations within onnx_tests.rs in order to test a rank-2 index tensor (since it was only testing rank 1 before, and the purpose of this update is to support greater ranks).
There are a couple of tests that I added to the node/gather.rs file that I'll happily remove if desired -- they were really just there for my own sanity checks.
Tests are passing

I'm not sure exactly what you mean. Do you mean that you have implemented as a new operator entirely? I guess we will see once the PR is opened

See my second bullet point above.

I'm going be AFK for the next hour or so, but I'll open a PR once I'm back.

laggui commented 1 week ago

I added a gather_onnx method to the tensor/api/numeric.rs module -- this is the situation I had in mind regarding the issue of naming: that module already has gather which implements the spec for GatherElements, rather than Gather

I don't think this is desired for the public API. Depending on the implementation, it's probably best to just do this in the GatherNode codegen. Will wait for your PR to see how this all comes into play 🙂

alteredoxide commented 1 week ago

@laggui No worries. I'll remove the public api code and integrate it into the codegen directly before submitting a PR.

laggui commented 1 week ago

Since the original issue has been resolved, closing this issue in favor of #2192.

tracel-ai / burn

burn-import panic #2023