Transformer network fails to type

igor-yusupov commented 1 year ago

I have exported transformer model: https://pytorch.org/docs/1.13/generated/torch.nn.Transformer.html to onnx and found that version 0.20 is very slow. version 0.19 works acceptably, but the call to into_optimized for the Attention block does not work.

igor-yusupov commented 1 year ago

error when calling into_optimized: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Translating node #86 "_encoder_layers.0_self_attn_Add_2" Add ToTypedTranslator

kali commented 1 year ago

Please provide me the .onnx file and some input in .npz files. I don't know how to export models myself, this would take me hours. See https://github.com/sonos/tract/blob/main/doc/cli-recipe.md#running-a-test-case

igor-yusupov commented 1 year ago

@kali I uploaded weights for encoder to google drive: https://drive.google.com/file/d/1H95-uZ4r9k11FRCOxIysJhq4pgp7B9Ud/view?usp=sharing

inference example code:

use tract_ndarray::Array;
use tract_onnx::prelude::*;

fn main() {
    let encoder = tract_onnx::onnx()
        .model_for_path("./encoder.onnx")
        .unwrap()
        .into_runnable()
        .unwrap();
    let src: Tensor = Array::from_shape_vec(
        (9, 1),
        vec![1, 34543, 49935, 1242, 775, 81, 1931, 49936, 2]
            .iter()
            .map(|&x| x as i64)
            .collect(),
    )
    .unwrap()
    .into();
    let src_len = *src.clone().shape().get(0).unwrap() as usize;
    let src_mask: Tensor = Array::<bool, _>::from_elem((src_len, src_len), false).into();
    let inputs = tvec!(src.clone().into(), src_mask.into());
    let outputs = encoder.run(inputs).unwrap();
    println!("{:?}", outputs);
}

kali commented 1 year ago

I'm not sure input in the code sample are valid. tract expects batch,sequence,I64 and batch,sequence,Bool. The code sample makes inputs of shapes [9,1] and [9,9]. In non-optimized mode, it looks like tract does not validate input shapes so I think this work by some kind of accident. But anyway, this happens after the optimisation, so this is not the source of the problem.

@igor-yusupov do you know if onnxruntime is happy with this model and these inputs ?

igor-yusupov commented 1 year ago

@kali yes, it's ok, because I set dynamic_axes to make it possible for a neural network to accept sequences of any length. I also checked the outputs with what the python code gives out, everything matches. Do I understand correctly that the into_optimized method doesn't work because the input size is dynamic? I'm more concerned that the rust code is slower, especially with version 0.20.

igor-yusupov commented 1 year ago

yes, I tried to remove dynamic_axes and into_optimized now works)

kali commented 1 year ago

into_optimized should work, even with dynamic dimensions. Does the model work and gives correct results ? even with non-1 batch (and sequence) dimension ?

igor-yusupov commented 1 year ago

@kali yes, if I change the length of sequence it works correctly. but I didn't try to change a batch_size

kali commented 1 year ago

Can you try with batch_size ?

igor-yusupov commented 1 year ago

@kali yes, I set batch_size equals to 2 for src and everything works. I send src as [9, 2] array and src_mask as [9, 9] array.

kali commented 1 year ago

I think this network is somewhat broken. Doublechecked with netron, but I can't make sense of all of it.

According to metadata, both inputs should be of shape (batch,sequence). src is integers tokens ids, src_mask is bool. Doublechecked here with netron:

This leads to a collision in tract inference (after some improvement to error detection):

netron and other tools do not see the problem because they do not compute on symbolic dimensions, so the dimension are "erased" But with setting concrete dimension (7 for batch, 9 for seq) with onnx-tool and running shape inference, the problems shows up in netron too :

Broadcasting 72,7,7 and 1,7,9 is invalid (this first dim is fine because one of them is a 1, the last one is not.) onnx-tool does it anyway but it's wrong.

So is it "just" a metadata issue ? Calling the export function of torch with wrong dynamic input specification ? If I override the inputs in tract, to batch,sequence,i64 and batch,batch,bool, I get the typing to "work" even with the dynamic dimensions.

But this does not make sense. Batch are usually a pure iterative extra dimension, it should not "mix" itself with the semantic dimensions like that. And having batch,batch as input does not make much more sense... So I suspect the problem runs deeper than a metadata problem.

igor-yusupov commented 1 year ago

@kali yeah, I made a mistake in the naming. I need to swap "batch" and "sentence". I have now built the model without specifying a dynamic axis for batch, it works faster, but the into_optimized method still does not work.

    0: Infering facts
    1: Applying rule outputs[0].shape == sequence,1,512
    2: Unifying shapes Add533_dim_0,Add533_dim_1,Add533_dim_2 and sequence,1,512
    3: Impossible to unify Sym(Add533_dim_0) with Sym(sequence).'

igor-yusupov commented 1 year ago

but my main question is why it works 10x slower than python code? I also checked running with onnxruntime lib and python code is faster. but 0.20 version works 10000x slower)

kali commented 1 year ago

As long as the network can not be optimized, tract performance will be terrible, comparing with any implementation is meaningless.

I think you just need to cancel the output_fact now (.with_output_fact(0, InferenceFact::default())

igor-yusupov commented 1 year ago

@kali thanks, it helped! but now I have issue that decoder has another (wrong) output😅. everything OK with encoder. I can send you a code of the decoder if you want. without into_optimized for decoder it works correctly but output is different.

igor-yusupov commented 1 year ago

but version 0.19 works correctly. faster than before but slowly than 0.20 now)

kali commented 1 year ago

did you manage to solve the decoder problem ? or do you need me to have a look ? in that case I'll need the onnx file and some example of inputs again.

igor-yusupov commented 1 year ago

@kali encoder and decoder weights: https://drive.google.com/file/d/19D_XOJVblUCmCGuRJ8J1czrhm827e5hq/view?usp=share_link

example code:

use tract_ndarray::Array;
use tract_onnx::prelude::*;

fn main() {
    let encoder = tract_onnx::onnx()
        .model_for_path("./encoder.onnx")
        .unwrap()
        .with_output_fact(0, InferenceFact::default())
        .unwrap()
        .into_optimized()
        .unwrap()
        .into_runnable()
        .unwrap();

    let decoder = tract_onnx::onnx()
        .model_for_path("./decoder.onnx")
        .unwrap()
        .with_output_fact(0, InferenceFact::default())
        .unwrap()
        .into_optimized()
        .unwrap()
        .into_runnable()
        .unwrap();

    let src: Tensor = Array::from_shape_vec(
        (8, 1),
        vec![1, 3385, 1826, 4, 7468, 3287, 2638, 2]
            .iter()
            .map(|&x| x as i64)
            .collect(),
    )
    .unwrap()
    .into();
    let src_len = *src.clone().shape().get(0).unwrap() as usize;
    let src_mask: Tensor = Array::<bool, _>::from_elem((src_len, src_len), false).into();
    let inputs = tvec!(src.clone().into(), src_mask.into());
    let memory = encoder.run(inputs).unwrap()[0]
        .to_array_view::<f32>()
        .unwrap()
        .to_owned();

    let ys = Array::from_elem((1, 1), 1 as i64);
    let tgt_mask_size = *ys.shape().get(0).unwrap() as usize;
    let tgt_mask = Array::<bool, _>::from_shape_fn((tgt_mask_size, tgt_mask_size), |(i, j)| {
        if i < j {
            true
        } else {
            false
        }
    })
    .to_owned();
    let inputs_decoder = tvec!(
        ys.clone().into_tensor().into(),
        memory.clone().into_tensor().clone().into(),
        tgt_mask.clone().into_tensor().into()
    );
    let out = decoder.run(inputs_decoder).unwrap()[0]
        .to_array_view::<f32>()
        .unwrap()
        .to_owned();

    println!("{:?}", out);
}

You can see that with 0.20.4 and 0.19.15 versions there are different outputs. With version 0.19.15 output is correct.

kali commented 1 year ago

Thanks for taking the time. I will have a look.

kali commented 1 year ago

This is what the network expects:

TSAR 24/05 10:21 ~/dev/sonos/tract/issue-1088% cargo run -p tract -- decoder.onnx --onnx-ignore-output-shapes dump --io-long | grep Source -B 0 -C 1
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `/home/kali/dev/sonos/tract/target/debug/tract decoder.onnx --onnx-ignore-output-shapes dump --io-long`
┏ 0 Source ys
┃   * output fact #0: sequence,1,I64 >1/0 MODEL INPUT #0
--
┃┃┏ 22 Source tgt_mask
┃┃┃   * output fact #0: sequence,sequence,Bool >23/0 MODEL INPUT #2
--
┃┃┃┏ 65 Source memory
┃┃┃┃   * output fact #0: sequence,1,512,F32 >67/0 >74/0 >183/0 >190/0 >299/0 >306/0 MODEL INPUT #1

So input 0 is seq,1. input 1 is seq,1,512. input 2 is seq,seq.

And you give it:

[src/main.rs:54] t.shape() = [
    1,
    1,
]
[src/main.rs:54] t.shape() = [
    8,
    1,
    512,
]
[src/main.rs:54] t.shape() = [
    1,
    1,
]

Second input first dim should be 1, not 8.

igor-yusupov commented 1 year ago

@kali I found an error, this was due to the names of the dynamic axes. Now it works blazingly fast.) Thanks a lot for your help and for your work!

sonos / tract

Transformer network fails to type #1088