Slice op incorrectly parses 'ends' attribute

kernhanda commented 2 years ago

I was trying to run onnx2c on a small model of mine (coco_3M.zip), but I ran into the following error:

$ ./onnx2c ~/code/inferred_person_coco_3M.onnx
2022-03-05 15-18-39.725 [Fatal] (resolveOutput) Concat's input tensors must have the same shape, except for the dimension size of the axis to concatenate on.
onnx2c: ~/code/onnx2c/src/nodes/concat.h:103: virtual void toC::Concat::resolveOutput(const std::vector<const toC::Tensor*>&, std::vector<toC::Tensor*>&): Assertion `false' failed.
[1]    1633 abort      ./onnx2c ~/code/inferred_person_coco_3M.onnx

The same model passes ONNX's model checker, renders fine in netron.app, and executes correctly through ONNXRuntime. I traced the differences to the Slice nodes in this model, which has an ends attribute of 9223372036854775807, aka 2**63 - 1, aka std::numeric_limits<int64_t>::max(). Protobuf for some reason parses this number as -1, even though the value is being stored in an int64_t. Not sure why that is, but I thought you could take a look.

This issue can be seen starting at node Slice_33. onnx2c calculates the output as 1x16x14x16 because the ends is read as [-1] instead of [2**63 - 1].

The following python script can be used to update models for a workaround:

import onnx

MODEL_IN_FILENAME = 'foo.onnx'
MODEL_OUT_FILENAME = 'foo_adjusted.onnx'

m = onnx.load_model(MODEL_IN_FILENAME)

def handle_slice_node(slice_node):                     
    for a in slice_node.attribute:                     
        if a.name == 'ends':                           
            for i, v in enumerate(a.ints):             
                if v == (2**63 - 1):                   
                    print("Found a value to adjust")   
                    a.ints[i] = 2**31 - 1              

for n in m.graph.node:        
    if n.op_type == 'Slice':  
        handle_slice_node(n)  

onnx.save_model(m, MODEL_OUT_FILENAME)

kraiskil commented 2 years ago

Thanks for reporting. With such a through report, it was easy to spot this - this is a clear cut bug in onnx2c. The branch mentioned above fixes this. I want to see if I can make a single-node regression test for this before merging into master.

Slightly related: in order to be able to replicate this issue, I had to fix another bug in the Concat node (patch in the branch above) that caused onnx2c to error out before hitting this. Do you have local changes to onnx2c?

kernhanda commented 2 years ago

Wow, keen eye! I totally missed the obvious bug fix :)

Good point about the other changes in concat. Here they are

diff --git a/src/nodes/concat.h b/src/nodes/concat.h
index d158bf1..eea1ad6 100644
--- a/src/nodes/concat.h
+++ b/src/nodes/concat.h
@@ -85,7 +85,7 @@ namespace toC {

            void resolveOutput(const std::vector<const Tensor *> &inputs, std::vector<Tensor *> &outputs) override {
                    node_inputs = inputs;
-                 if (inputs.size() < 2)
+                 if (inputs.size() < 1)
                            ERROR("Wrong number of inputs to Concat");

                    if (axis < 0)
@@ -98,7 +98,7 @@ namespace toC {
                    size_t i, j;
                    std::vector<int> dims = inputs[0]->data_dim;
                    for (i = 0; i < input_count; i++) {
-                         for (j = 1; j < dims.size(); j++) {
+                         for (j = 0; j < dims.size(); j++) {
                                    if (dims[j] != inputs[i]->data_dim[j] && (int) j != axis)
                                            ERROR("Concat's input tensors must have the same shape, except for the "
                                                      "dimension size of the axis to concatenate on.");

Rationate:

Min number of inputs to concat is actually 1
If axis == 0, it didn't look like the dims were being checked, but I didn't check if there was another code path for the verification

kraiskil commented 2 years ago

Thanks for the diff. I didn't find an easy way to add a minimized test for this, so I merged this branch issue_15 to master without a test. Looking at the onnx2c test suite there already was a slice_INT64_MAX test, but that was for ONNX opset 10 or later, where the start/end/axis had moved from attributes to input tensors, so this problem did not repeat... I could not find with quick searching an api call to set the ONNX opset version with the script using scailable onnx which I use to make the unit tests, so I let this one go without a regression test.

About the changes to concat:

the first point is the bug I had to fix too
the second: that looks like a bug too - I don't see why dimension 0 should not be checked at this point. It looks like a case of "the programmer stops working when the code starts working"... (this was a hobby project, and that shows at times :) ). I'll commit that fix, thanks.

kraiskil / onnx2c

Slice op incorrectly parses 'ends' attribute #15