sonos / tract

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference
Other
2.18k stars 210 forks source link

Transformer network fails to type #1088

Closed igor-yusupov closed 1 year ago

igor-yusupov commented 1 year ago

I have exported transformer model: https://pytorch.org/docs/1.13/generated/torch.nn.Transformer.html to onnx and found that version 0.20 is very slow. version 0.19 works acceptably, but the call to into_optimized for the Attention block does not work.

igor-yusupov commented 1 year ago

error when calling into_optimized: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Translating node #86 "_encoder_layers.0_self_attn_Add_2" Add ToTypedTranslator

kali commented 1 year ago

Please provide me the .onnx file and some input in .npz files. I don't know how to export models myself, this would take me hours. See https://github.com/sonos/tract/blob/main/doc/cli-recipe.md#running-a-test-case

igor-yusupov commented 1 year ago

@kali I uploaded weights for encoder to google drive: https://drive.google.com/file/d/1H95-uZ4r9k11FRCOxIysJhq4pgp7B9Ud/view?usp=sharing

inference example code:

use tract_ndarray::Array;
use tract_onnx::prelude::*;

fn main() {
    let encoder = tract_onnx::onnx()
        .model_for_path("./encoder.onnx")
        .unwrap()
        .into_runnable()
        .unwrap();
    let src: Tensor = Array::from_shape_vec(
        (9, 1),
        vec![1, 34543, 49935, 1242, 775, 81, 1931, 49936, 2]
            .iter()
            .map(|&x| x as i64)
            .collect(),
    )
    .unwrap()
    .into();
    let src_len = *src.clone().shape().get(0).unwrap() as usize;
    let src_mask: Tensor = Array::<bool, _>::from_elem((src_len, src_len), false).into();
    let inputs = tvec!(src.clone().into(), src_mask.into());
    let outputs = encoder.run(inputs).unwrap();
    println!("{:?}", outputs);
}
kali commented 1 year ago

I'm not sure input in the code sample are valid. tract expects batch,sequence,I64 and batch,sequence,Bool. The code sample makes inputs of shapes [9,1] and [9,9]. In non-optimized mode, it looks like tract does not validate input shapes so I think this work by some kind of accident. But anyway, this happens after the optimisation, so this is not the source of the problem.

@igor-yusupov do you know if onnxruntime is happy with this model and these inputs ?

igor-yusupov commented 1 year ago

@kali yes, it's ok, because I set dynamic_axes to make it possible for a neural network to accept sequences of any length. I also checked the outputs with what the python code gives out, everything matches. Do I understand correctly that the into_optimized method doesn't work because the input size is dynamic? I'm more concerned that the rust code is slower, especially with version 0.20.

igor-yusupov commented 1 year ago

yes, I tried to remove dynamic_axes and into_optimized now works)

kali commented 1 year ago

into_optimized should work, even with dynamic dimensions. Does the model work and gives correct results ? even with non-1 batch (and sequence) dimension ?

igor-yusupov commented 1 year ago

@kali yes, if I change the length of sequence it works correctly. but I didn't try to change a batch_size

kali commented 1 year ago

Can you try with batch_size ?

igor-yusupov commented 1 year ago

@kali yes, I set batch_size equals to 2 for src and everything works. I send src as [9, 2] array and src_mask as [9, 9] array.

kali commented 1 year ago

I think this network is somewhat broken. Doublechecked with netron, but I can't make sense of all of it.

According to metadata, both inputs should be of shape (batch,sequence). src is integers tokens ids, src_mask is bool. Doublechecked here with netron:

image image

This leads to a collision in tract inference (after some improvement to error detection): image

netron and other tools do not see the problem because they do not compute on symbolic dimensions, so the dimension are "erased" But with setting concrete dimension (7 for batch, 9 for seq) with onnx-tool and running shape inference, the problems shows up in netron too :

image

Broadcasting 72,7,7 and 1,7,9 is invalid (this first dim is fine because one of them is a 1, the last one is not.) onnx-tool does it anyway but it's wrong.

So is it "just" a metadata issue ? Calling the export function of torch with wrong dynamic input specification ? If I override the inputs in tract, to batch,sequence,i64 and batch,batch,bool, I get the typing to "work" even with the dynamic dimensions.

But this does not make sense. Batch are usually a pure iterative extra dimension, it should not "mix" itself with the semantic dimensions like that. And having batch,batch as input does not make much more sense... So I suspect the problem runs deeper than a metadata problem.

igor-yusupov commented 1 year ago

@kali yeah, I made a mistake in the naming. I need to swap "batch" and "sentence". I have now built the model without specifying a dynamic axis for batch, it works faster, but the into_optimized method still does not work.

    0: Infering facts
    1: Applying rule outputs[0].shape == sequence,1,512
    2: Unifying shapes Add533_dim_0,Add533_dim_1,Add533_dim_2 and sequence,1,512
    3: Impossible to unify Sym(Add533_dim_0) with Sym(sequence).'
igor-yusupov commented 1 year ago

but my main question is why it works 10x slower than python code? I also checked running with onnxruntime lib and python code is faster. but 0.20 version works 10000x slower)

kali commented 1 year ago

As long as the network can not be optimized, tract performance will be terrible, comparing with any implementation is meaningless.

I think you just need to cancel the output_fact now (.with_output_fact(0, InferenceFact::default())

igor-yusupov commented 1 year ago

@kali thanks, it helped! but now I have issue that decoder has another (wrong) output😅. everything OK with encoder. I can send you a code of the decoder if you want. without into_optimized for decoder it works correctly but output is different.

igor-yusupov commented 1 year ago

but version 0.19 works correctly. faster than before but slowly than 0.20 now)

kali commented 1 year ago

did you manage to solve the decoder problem ? or do you need me to have a look ? in that case I'll need the onnx file and some example of inputs again.

igor-yusupov commented 1 year ago

@kali encoder and decoder weights: https://drive.google.com/file/d/19D_XOJVblUCmCGuRJ8J1czrhm827e5hq/view?usp=share_link

example code:

use tract_ndarray::Array;
use tract_onnx::prelude::*;

fn main() {
    let encoder = tract_onnx::onnx()
        .model_for_path("./encoder.onnx")
        .unwrap()
        .with_output_fact(0, InferenceFact::default())
        .unwrap()
        .into_optimized()
        .unwrap()
        .into_runnable()
        .unwrap();

    let decoder = tract_onnx::onnx()
        .model_for_path("./decoder.onnx")
        .unwrap()
        .with_output_fact(0, InferenceFact::default())
        .unwrap()
        .into_optimized()
        .unwrap()
        .into_runnable()
        .unwrap();

    let src: Tensor = Array::from_shape_vec(
        (8, 1),
        vec![1, 3385, 1826, 4, 7468, 3287, 2638, 2]
            .iter()
            .map(|&x| x as i64)
            .collect(),
    )
    .unwrap()
    .into();
    let src_len = *src.clone().shape().get(0).unwrap() as usize;
    let src_mask: Tensor = Array::<bool, _>::from_elem((src_len, src_len), false).into();
    let inputs = tvec!(src.clone().into(), src_mask.into());
    let memory = encoder.run(inputs).unwrap()[0]
        .to_array_view::<f32>()
        .unwrap()
        .to_owned();

    let ys = Array::from_elem((1, 1), 1 as i64);
    let tgt_mask_size = *ys.shape().get(0).unwrap() as usize;
    let tgt_mask = Array::<bool, _>::from_shape_fn((tgt_mask_size, tgt_mask_size), |(i, j)| {
        if i < j {
            true
        } else {
            false
        }
    })
    .to_owned();
    let inputs_decoder = tvec!(
        ys.clone().into_tensor().into(),
        memory.clone().into_tensor().clone().into(),
        tgt_mask.clone().into_tensor().into()
    );
    let out = decoder.run(inputs_decoder).unwrap()[0]
        .to_array_view::<f32>()
        .unwrap()
        .to_owned();

    println!("{:?}", out);
}

You can see that with 0.20.4 and 0.19.15 versions there are different outputs. With version 0.19.15 output is correct.

kali commented 1 year ago

Thanks for taking the time. I will have a look.

kali commented 1 year ago

This is what the network expects:

TSAR 24/05 10:21 ~/dev/sonos/tract/issue-1088% cargo run -p tract -- decoder.onnx --onnx-ignore-output-shapes dump --io-long | grep Source -B 0 -C 1
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `/home/kali/dev/sonos/tract/target/debug/tract decoder.onnx --onnx-ignore-output-shapes dump --io-long`
┏ 0 Source ys
┃   * output fact #0: sequence,1,I64 >1/0 MODEL INPUT #0
--
┃┃┏ 22 Source tgt_mask
┃┃┃   * output fact #0: sequence,sequence,Bool >23/0 MODEL INPUT #2
--
┃┃┃┏ 65 Source memory
┃┃┃┃   * output fact #0: sequence,1,512,F32 >67/0 >74/0 >183/0 >190/0 >299/0 >306/0 MODEL INPUT #1

So input 0 is seq,1. input 1 is seq,1,512. input 2 is seq,seq.

And you give it:

[src/main.rs:54] t.shape() = [
    1,
    1,
]
[src/main.rs:54] t.shape() = [
    8,
    1,
    512,
]
[src/main.rs:54] t.shape() = [
    1,
    1,
]

Second input first dim should be 1, not 8.

igor-yusupov commented 1 year ago

@kali I found an error, this was due to the names of the dynamic axes. Now it works blazingly fast.) Thanks a lot for your help and for your work!