k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.12k stars 362 forks source link

Issue in first word in zipformer2 #348

Open bhaswa opened 11 months ago

bhaswa commented 11 months ago

I found that in many cases the first word of a audio is not able to get decoded properly. But when I use the .pth model in icefall instead of .onnx model, the word gets decoded properly.

csukuangfj commented 11 months ago

Are you using the latest icefall to export the model and also are you using the latest sherpa-onnx for testing?

bhaswa commented 11 months ago

Yes. I updated both icefall and sherpa-onnx. Still facing the same issue.

bhaswa commented 11 months ago

Any update on this ?

csukuangfj commented 11 months ago

Are you able to share the test wave file?

bhaswa commented 11 months ago

I observed this scenario in a model which I trained with custom data. The same audio might not behave the same way in another model (which is trained with different set of data).

bhaswa commented 11 months ago

Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.

csukuangfj commented 11 months ago

Is there any pre trained model (.pth) available ? I will share a test wave file testing on that model.

Yes, please find the models in the RESULTS.md of each recipe in icefall, e.g., librispeech. For each experiment, there is a link to the huggingface repo containing pre-trained models.

bhaswa commented 11 months ago

I am attaching two audio files here.

audios.zip

1.wav: pth output(from icefall): GOD'S THE OLD SCHOOL STROKE OUT onnx output(from icefall): GUIDABLE SCHOOL STROKE OUT onnx output (from sherpa-onnx): GO AS TO WALK SCHOOL STROKE OUT (None of the outputs are matching)

2.wav pth output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output(from icefall): SYSTEM MENU OR PIANGARA AT YOU WITH THEIR KEY onnx output (from sherpa-onnx): SISTER MENU OR PEN GIRL AT YOU WHERE THEIR KEY (pth and onnx output is matching in icefall but sherpa onnx output is different)

I used the model streaming zipformer (zipformer + pruned stateless transducer) [https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17/tree/main/exp] from huggingface for testing these audios.

Command to convert the model to onnx is

python3 ./zipformer/export-onnx-streaming.py \ --exp-dir ./zipformer/exp \ --tokens data/lang_bpe_500/tokens.txt \ --causal 1 \ --chunk-size 16 \ --left-context-frames 128 \ --epoch 30 \ --avg 1 \ --use-averaged-model False

bhaswa commented 11 months ago

Any update on this issue?

kamirdin commented 9 months ago

Zero key/value cache from encoder initialization states may be causing this discrepancy between training and decoding?