k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

export a non-stream onnx model from a streaming pytorch model #1576

Closed 1215thebqtic closed 1 month ago

1215thebqtic commented 1 month ago

Hi,

I'm trying to export a non-stream onnx model from a streaming pytorch zipformer2 model. Training a non-stream zipformer2 model from scratch takes long time, so I decide to use "--chunk-size -1 --left-context-frames -1" as a non-stream model.

The streaming model was trained using causal=1.

The script I used to export the non-stream onnx model from a streaming pytorch model:

./zipformer/export-onnx.py \
  --tokens $tokenfile \
  --use-averaged-model 0 \
  --epoch 99 \
  --avg 1 \
  --exp-dir zipformer/exp_L_causal_context_2 \
  --num-encoder-layers "2,2,3,4,3,2" \
  --downsampling-factor "1,2,4,8,4,2" \
  --feedforward-dim "512,768,1024,1536,1024,768" \
  --num-heads "4,4,4,8,4,4" \
  --encoder-dim "192,256,384,512,384,256" \
  --query-head-dim 32 \
  --value-head-dim 12 \
  --pos-head-dim 4 \
  --pos-dim 48 \
  --encoder-unmasked-dim "192,192,256,256,256,192" \
  --cnn-module-kernel "31,31,15,15,15,31" \
  --decoder-dim 512 \
  --joiner-dim 512 \
  --causal True \
  --chunk-size -1 \
  --left-context-frames -1

When I use the following code to decode the onnx model:

./zipformer/onnx_pretrained.py \
  --encoder-model-filename $repo/encoder-epoch-99-avg-1.onnx \
  --decoder-model-filename $repo/decoder-epoch-99-avg-1.onnx \
  --joiner-model-filename $repo/joiner-epoch-99-avg-1.onnx \
  --tokens $tokenfile \
  icefall-asr-zipformer-streaming-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav

An error occured: broadcasting_error

the error node in netron: onnx_node

According to the netron and zipformer code, I think it's because of the broadcasting in https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L671 x_chunk's shape is (batch_size, num_channels, chunk_size), chunk_scale's shape is (num_channels, chunk_size). I noticed that the streaming_forward also has the same code(https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L730), but there aren't any errors when exporting the streaming onnx model.

I deleted this line of code, and waves can be decoded successfully, the wers on my test dataset differ a little bit: 5.89 (pytorch) versus 5.61 (onnx) (pytorch decoding script: ./zipformer/pretrained.py; onnx decoding script: ./zipformer/onnx_pretrained.py)

And my questions are:

  1. Why does the broadcasting in non-stream mode lead to onnx errors, while no errors in streaming onnx model ?
  2. How do I change this line of code that I can avoid this error, and make the wer is same as the pytorch one?

Thanks!

JinZr commented 1 month ago

hi,

im not quite familiar with the onnx export scripts, but i believe it’s not designed to just directly export a streaming model to a non-streaming one, perhaps the conversion ruins some of the attention masks, anyway if you need a non-streaming model, you should train a non-streaming model at the first place.

best jin

On Apr 2, 2024, at 17:26, 1215thebqtic @.***> wrote:

Hi,

I'm trying to export a non-stream onnx model from a streaming pytorch zipformer2 model. Training a non-stream zipformer2 model from scratch takes long time, so I decide to use "--chunk-size -1 --left-context-frames -1" as a non-stream model.

The streaming model was trained using causal=1.

The script I used to export the non-stream onnx model from a streaming pytorch model:

./zipformer/export-onnx.py \ --tokens $tokenfile \ --use-averaged-model 0 \ --epoch 99 \ --avg 1 \ --exp-dir zipformer/exp_L_causal_context_2 \ --num-encoder-layers "2,2,3,4,3,2" \ --downsampling-factor "1,2,4,8,4,2" \ --feedforward-dim "512,768,1024,1536,1024,768" \ --num-heads "4,4,4,8,4,4" \ --encoder-dim "192,256,384,512,384,256" \ --query-head-dim 32 \ --value-head-dim 12 \ --pos-head-dim 4 \ --pos-dim 48 \ --encoder-unmasked-dim "192,192,256,256,256,192" \ --cnn-module-kernel "31,31,15,15,15,31" \ --decoder-dim 512 \ --joiner-dim 512 \ --causal True \ --chunk-size -1 \ --left-context-frames -1 When I use the following code to decode the onnx model:

./zipformer/onnx_pretrained.py \ --encoder-model-filename $repo/encoder-epoch-99-avg-1.onnx \ --decoder-model-filename $repo/decoder-epoch-99-avg-1.onnx \ --joiner-model-filename $repo/joiner-epoch-99-avg-1.onnx \ --tokens $tokenfile \ icefall-asr-zipformer-streaming-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav An error occured: broadcasting_error.PNG (view on web) https://github.com/k2-fsa/icefall/assets/11812181/35281c26-7db7-4c76-80dc-b2558e11e0d3 the error node in netron: onnx_node.PNG (view on web) https://github.com/k2-fsa/icefall/assets/11812181/895b6969-4665-4539-8f98-d9b199c33f35 According to the netron and zipformer code, I think it's because of the broadcasting in https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L671 x_chunk's shape is (batch_size, num_channels, chunk_size), chunk_scale's shape is (num_channels, chunk_size). I noticed that the streaming_forward also has the same code(https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L730), but there aren't any errors when exporting the streaming onnx model.

I deleted this line of code, and waves can be decoded successfully, the wers on my test dataset differ a little bit: 5.89 (pytorch) versus 5.61 (onnx) (pytorch decoding script: ./zipformer/pretrained.py; onnx decoding script: ./zipformer/onnx_pretrained.py)

And my questions are:

Why does the broadcasting in non-stream mode lead to onnx errors, while no errors in streaming onnx model ? How do I change this line of code that I can avoid this error, and make the wer is same as the pytorch one? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42AK6CF2KO2IGJXE223Y3J2S3AVCNFSM6AAAAABFTAJHCCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDAMBQGU2DINQ. You are receiving this because you are subscribed to this thread.

1215thebqtic commented 1 month ago

hi jin, thanks for your reply! More than 100k hours are used to train the model, training will take about 20days due to limited gpus. As for the streaming model, the non-stream decoding options (chunk-size=-1, --left-context-frames=-1) shows relative 20%~30% better wer than the streaming decoding options (chunk-size=32, --left-context-frames=128), so I decide to export the non-streaming model.

MicKot commented 1 month ago

setting chunk-size=-1 and left-context-frames=-1 does not mean 'non-streaming', it just means the model gets full context but the model is still 'streaming', i.e. the convolutions are still causal. By exporting the model by nonstreaming script you get rid of the ability to use the cache, which is the whole point of training streaming model. From my testing setting chunk-size=512 and left-context-frames=512 gives the best WER (which is not suprising given context is king) if WER is what you care about while still having the ability to 'stream' (just not in realtime)

csukuangfj commented 1 month ago

我在看这个

1215thebqtic commented 1 month ago

我在看这个

您好,那个错误我发现是因为这个函数里面有个if else判断语句 https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/scaling.py#L681

导出onnx的时候输入的input是 https://github.com/k2-fsa/icefall/blob/6cbddaa8e32ec5bc5c2fcc60a6d2409c7f5c7b11/egs/librispeech/ASR/zipformer/export-onnx.py#L297 所以走的上面那个if else中if,我在用导出的模型测试的时候语音都是十几秒的语音,所以会报那个维度不匹配的错误。如果把这个dummy input改成x = torch.zeros(1, 1000, 80, dtype=torch.float32)就会走else那个分支,就不会报错可以正常识别了。但是,现在是我不知道怎么把这个if else合并或者拆分让他长短语音都能用

csukuangfj commented 1 month ago

Replied in the Next-gen Kaldi WeChat group.

The fix is

diff --git a/egs/librispeech/ASR/zipformer/scaling_converter.py b/egs/librispeech/ASR/zipformer/scaling_converter.py
index 76622fa1..346db55e 100644
--- a/egs/librispeech/ASR/zipformer/scaling_converter.py
+++ b/egs/librispeech/ASR/zipformer/scaling_converter.py
@@ -36,7 +36,7 @@ from scaling import (
     SwooshROnnx,
     Whiten,
 )
-from zipformer import CompactRelPositionalEncoding
+from zipformer import CompactRelPositionalEncoding, ChunkCausalDepthwiseConv1d

 # Copied from https://pytorch.org/docs/1.9.0/_modules/torch/nn/modules/module.html#Module.get_submodule  # noqa
@@ -93,6 +93,10 @@ def convert_scaled_to_non_scaled(
             # the input changes, so we have to use torch.jit.script()
             # to replace torch.jit.trace()
             d[name] = torch.jit.script(m)
+        elif is_onnx and isinstance(m, ChunkCausalDepthwiseConv1d):
+            # to export a zipformer model that is trained with --causal=1
+            # but to export it with --chunk-size=-1 and --left-chunk-size=-1
+            d[name] = torch.jit.script(m)

     for k, v in d.items():
         if "." in k: