ONNX -> NNEF changes outlet names

tgolsson commented 2 years ago

Hello!

Thanks to yesterday's fix I'm now able to load and infer on NNEFs. However, we rely heavily on input and output naming to ensure we input the right arguments to the right position; and the conversion from ONNX to NNEF breaks this. It seems like the outlet in the new is a variant of the last layer in the ONNX model:

// ONNX
$ tractor-cli api brains/test-large.onnx
Inputs:
        features                                : [280]
        images                                  : [30, 30, 2]
        epsilon                                 : [24]

Outputs:
        tanh_stretch                            : [24]

// NNEF
$ tractor-cli api brains/test-large.nnef
Inputs:
        features                                : [280]
        images                                  : [30, 30, 2]
        epsilon                                 : [24]

Outputs:
        model_4_tanh_stretch_PartitionedCall_Tanh: [24]

This is how I extract the output API. There's a minor difference in shape extraction due to differences between InferenceModel and TypedModel, but the the rest is the same. By liberally calling this across the conversion path I've found that this remains consistent until I call nnef.write(model, writer). The output file doesn't seem to contain tanh_stretch at all.

        // model: TypedModel
        let mut outputs: Vec<(String, Vec<usize>)> = Default::default();

        for output_outlet in &model.outputs {
            let name = model.outlet_labels[output_outlet]
                .split(':') // ignore tensorflow slot numbers
                .next()
                .unwrap()
                .to_owned();

            let output_shape = &model.output_fact(output_outlet.slot)?.shape;
            let clean_shape = output_shape
                .iter()
                .filter_map(|dim| dim.to_i64().map(|v| v as usize).ok())
                .collect();

            outputs.push((name, clean_shape));
        }

Looking at the generated .nnef file I've got the following definition:

graph network( features, images, epsilon ) -> ( model_4_tanh_stretch_PartitionedCall_Tanh ) {

Where I'd expect the latter part to be tanh_stretch to match the outlet name in the onnx. Looking at the tract source I'm guessing this is caused by the mismatch in code between what I do to find the names and how the serializer deals with it; as it seems like even for an outlet (https://github.com/sonos/tract/blob/main/nnef/src/ser.rs#L116) it'll end up finding the node for the outlet, not the outlet label itself which never seems to be prepared for serialization at all.

kali commented 2 years ago

Mmm. It was my intention to preserve the original name as much as possible, for graph interface (and as much as practical for internal nodes), so I think I want to call this a bug, hoping I'm not opening a terrible pandora box. Would you be able to produce a test case model for this issue ?

tgolsson commented 2 years ago

I can't upload the file to the issue but if you shoot me a mail (tom.solberg@embark-studios.com) I can mail you one.

kali commented 2 years ago

pushed a tentative fix, tell me if it's enough.

tgolsson commented 2 years ago

@kali Just about to go OOO for rest of day; will test tomorrow morning!

tgolsson commented 2 years ago

@kali Tested and looks to be correct! Thanks again for the quick turn around.

kali commented 2 years ago

Thanks. I can not merge as is, I need to investigate a side regression, but should be happen soon.

sonos / tract

ONNX -> NNEF changes outlet names #713