k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
895 stars 287 forks source link

Output not matching after exporting updated Zipformer model to Onnx #1154

Open bhaswa opened 1 year ago

bhaswa commented 1 year ago

Hi, I have trained latest streaming zipformer model with custom dataset and exported the model to onnx. When I compare the output from original pth model and the onnx model, a accuracy gap of 5% is found in the exported onnx model.

csukuangfj commented 1 year ago

a accuracy gap of 5% is found in the exported onnx model

Could you identify the wave files that cause inconsistent recognition results?

If yes, could you use one of them to compute the encoder output and compare whether the encoder output is the same for icefall and sherpa-onnx?

bhaswa commented 1 year ago

Btw, I calculated the accuracy of onnx model using ./zipformer/onnx_pretrained-streaming.py, not sherpa-onnx.

csukuangfj commented 1 year ago

Btw, I calculated the accuracy of onnx model using ./zipformer/onnx_pretrained-streaming.py, not sherpa-onnx.

That is also ok. It is much easier to get the encoder output with /zipformer/onnx_pretrained-streaming.py.

bhaswa commented 1 year ago

@csukuangfj output from the encoder layer is not matching. I checked it for two audios, for one audio recognition result is same and in other audio recognition result is different. Both the cases encoder output is not matching.

bhaswa commented 1 year ago

@csukuangfj Any update on this ?

csukuangfj commented 1 year ago

output from the encoder layer is not matching

How large is the difference? If the input is the same, the encoder output should also be the same within some numeric tolerance.

bhaswa commented 1 year ago

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256 Dimension for onnx: 1 x 16 x 512

csukuangfj commented 1 year ago

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256

Dimension for onnx: 1 x 16 x 512

Please apply joiner.ecoder_proj layer to the one whose dim is 512.

The ONNX version invokes joiner.ecoder_proj automatically.

csukuangfj commented 1 year ago

I double checked the output. Outputs are completely different from the encoder layer for . Infact the dimensions are not matching.

Dimension for pth: 1 x 16 x 256

Dimension for onnx: 1 x 16 x 512

Please apply joiner.ecoder_proj layer to the output of PyTorch.

The ONNX version invokes joiner.ecoder_proj automatically.

bhaswa commented 1 year ago

After applying the joiner.ecoder_proj layer after encoder layer, now dimension is matching, but values are still different.

csukuangfj commented 1 year ago

but values are still different.

How large is the difference? You can use (a - b).abs().max() to get the max difference.

sanjuktasr commented 1 year ago
  1. the number of times encoder is called in pth inference is different from onnx inference. all streaming codes are use FYI.

  2. for a 0.5 sec audio pth calls encoder 2 times whereas onnx it is called only 1 time.