k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.68k stars 427 forks source link

Decoding method 'modified_beam_search' gives letters/words on silence, while 'greedy_search' works well #845

Open ChrystianKacki opened 6 months ago

ChrystianKacki commented 6 months ago

When I set decoding method to 'modified_beam_search' it returns letters/words on silence, after short time, periodically. But when I use default 'greedy_search' decoding method everything works well. As an example I used Python script for real-time speech recognition from a microphone with endpoint detection from here. I checked and the same situation also happens in Java API with C++ dynamic libraries loaded. The model used for recognition was created by me, using a modified icefall Common Voice recipe (with MLS and VoxPopuli datasets, 491 min of audio), on icefall docker image (torch2.2.2-cuda12.1), and has a WER of 4.41%.

csukuangfj commented 6 months ago

Have you tested your exported model in icefall?

Can you check that --context-size is the same in training and exporting?

ChrystianKacki commented 6 months ago

Have you tested your exported model in icefall?

How can I test my exported model in icefall ? By now I only used decode.py script from icefall CV recipe to get WER.

Can you check that --context-size is the same in training and exporting?

I checked and --context-size is the same in training and exporting, it has a default value of 2.

csukuangfj commented 6 months ago

By now I only used decode.py script from icefall CV recipe to get WER.

Have you tried modified beam search with decode.py ?

Please post the commands you use for training, decoding, and exporting.

ChrystianKacki commented 6 months ago

Have you tried modified beam search with decode.py ?

Yes, I used --decoding-method modified_beam_search with decode.py.

Please post the commands you use for training, decoding, and exporting.

Training: python3 scripts/train.py --world-size 8 --num-epochs 100 --start-epoch 1 --use-fp16 true --max-duration 550 --enable-musan true --use-validated-set true --bpe-model $data_dir/lang_bpe_500/bpe.model --manifest-dir $data_dir/fbank --exp-dir $base_dir

Decoding: python3 scripts/decode.py --epoch 100 --avg 1 --max-duration 550 --decode-chunk-len 32 --decoding-method modified_beam_search --use-averaged-model false --bpe-model $lang_dir/bpe.model --lang-dir $lang_dir --manifest-dir $data_dir/fbank --exp-dir $base_dir

Exporting: python3 scripts/export-onnx.py --epoch 100 --avg 1 --use-averaged-model false --tokens $data_dir/lang_bpe_500/tokens.txt --exp-dir $base_dir

csukuangfj commented 6 months ago

What is scripts? Which model are you using? What changes have you made to icefall?

ChrystianKacki commented 6 months ago

scripts is my local folder which contains all the files from the newest icefall Common Voice streaming zipformer transducer recipe from: https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming

Only change I made to icefall is adding MLS and VoxPopuli datasets to the CV preparation script prepare.sh, which is from: https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/prepare.sh MLS is in https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR VoxPopuli is in https://github.com/k2-fsa/icefall/tree/master/egs/voxpopuli/ASR

csukuangfj commented 6 months ago

Could you test your model with https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/onnx_pretrained.py and https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/jit_trace_pretrained.py and see if it works.

ChrystianKacki commented 6 months ago

Could you test your model with https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/onnx_pretrained.py

I tested, works perfectly, recognized text exactly matches original one.

and https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/jit_trace_pretrained.py

I tested, works perfectly too and recognized text exactly matches original one.

csukuangfj commented 6 months ago

Have you tried modified beam search with decode.py ?

Yes, I used --decoding-method modified_beam_search with decode.py.

Please post the commands you use for training, decoding, and exporting.

Training: python3 scripts/train.py --world-size 8 --num-epochs 100 --start-epoch 1 --use-fp16 true --max-duration 550 --enable-musan true --use-validated-set true --bpe-model $data_dir/lang_bpe_500/bpe.model --manifest-dir $data_dir/fbank --exp-dir $base_dir

Decoding: python3 scripts/decode.py --epoch 100 --avg 1 --max-duration 550 --decode-chunk-len 32 --decoding-method modified_beam_search --use-averaged-model false --bpe-model $lang_dir/bpe.model --lang-dir $lang_dir --manifest-dir $data_dir/fbank --exp-dir $base_dir

Exporting: python3 scripts/export-onnx.py --epoch 100 --avg 1 --use-averaged-model false --tokens $data_dir/lang_bpe_500/tokens.txt --exp-dir $base_dir

Could you also share the logs for the above 3 commands? (You can find them from the terminal output. Please post the first few lines of them where configuration arguments can be found.)

ChrystianKacki commented 6 months ago

Could you also share the logs for the above 3 commands? (You can find them from the terminal output. Please post the first few lines of them where configuration arguments can be found.)

Logs: train_log.txt decode_log.txt export-onnx_log.txt It was made on --world-size 8 so I give only cuda:0. Also --num-epochs in train.py is 50, and --epoch in decode.py and export-onnx.py is also 50, not 100 as I posted before, because I noticed that I trained two times: from epoch 1 to 50 and then from 51 to 100.

csukuangfj commented 6 months ago

By the way, are you using the latest icefall and latest sherpa-onnx ?

ChrystianKacki commented 6 months ago

Yes, I used docker image (torch2.2.2-cuda12.1) with icefall and after training I tested it with sherpa-onnx built from latest GitHub source.

ChrystianKacki commented 6 months ago

@csukuangfj Hello! Could You help me with this issue ? I shared the logs as You asked in the post before. Thanks in advance.

csukuangfj commented 6 months ago

I don't see anything abnormal in your logs.

Sorry that I have no idea why greedy search works but modified_beam_search does not.

(Could you share your model files so that we can reproduce it and debug it locally?)

ChrystianKacki commented 6 months ago

I don't see anything abnormal in your logs. Sorry that I have no idea why greedy search works but modified_beam_search does not.

Aha, I see, it's great that logs are OK.

Could you share your model files so that we can reproduce it and debug it locally?

Which model files should I share ? Do you mean exported encoder, decoder and joiner with .onnx extension and tokens.txt ?

csukuangfj commented 6 months ago

Do you mean exported encoder, decoder and joiner with .onnx extension and tokens.txt ?

Yes. Please also share a test wave file.

ChrystianKacki commented 6 months ago

Could you share your model files so that we can reproduce it and debug it locally? Please also share a test wave file.

Please see my shared folder with model and test wave files: link here