Open ChrystianKacki opened 6 months ago
Have you tested your exported model in icefall?
Can you check that --context-size is the same in training and exporting?
Have you tested your exported model in icefall?
How can I test my exported model in icefall ? By now I only used decode.py script from icefall CV recipe to get WER.
Can you check that --context-size is the same in training and exporting?
I checked and --context-size is the same in training and exporting, it has a default value of 2.
By now I only used decode.py script from icefall CV recipe to get WER.
Have you tried modified beam search with decode.py ?
Please post the commands you use for training, decoding, and exporting.
Have you tried modified beam search with decode.py ?
Yes, I used --decoding-method modified_beam_search
with decode.py.
Please post the commands you use for training, decoding, and exporting.
Training:
python3 scripts/train.py --world-size 8 --num-epochs 100 --start-epoch 1 --use-fp16 true --max-duration 550 --enable-musan true --use-validated-set true --bpe-model $data_dir/lang_bpe_500/bpe.model --manifest-dir $data_dir/fbank --exp-dir $base_dir
Decoding:
python3 scripts/decode.py --epoch 100 --avg 1 --max-duration 550 --decode-chunk-len 32 --decoding-method modified_beam_search --use-averaged-model false --bpe-model $lang_dir/bpe.model --lang-dir $lang_dir --manifest-dir $data_dir/fbank --exp-dir $base_dir
Exporting:
python3 scripts/export-onnx.py --epoch 100 --avg 1 --use-averaged-model false --tokens $data_dir/lang_bpe_500/tokens.txt --exp-dir $base_dir
What is scripts
? Which model are you using?
What changes have you made to icefall?
scripts
is my local folder which contains all the files from the newest icefall Common Voice streaming zipformer transducer recipe from:
https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming
Only change I made to icefall is adding MLS and VoxPopuli datasets to the CV preparation script prepare.sh
, which is from:
https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/prepare.sh
MLS is in https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR
VoxPopuli is in https://github.com/k2-fsa/icefall/tree/master/egs/voxpopuli/ASR
Could you test your model with https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/onnx_pretrained.py and https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/jit_trace_pretrained.py and see if it works.
Could you test your model with https://github.com/k2-fsa/icefall/blob/master/egs/commonvoice/ASR/pruned_transducer_stateless7_streaming/onnx_pretrained.py
I tested, works perfectly, recognized text exactly matches original one.
I tested, works perfectly too and recognized text exactly matches original one.
Have you tried modified beam search with decode.py ?
Yes, I used
--decoding-method modified_beam_search
with decode.py.Please post the commands you use for training, decoding, and exporting.
Training:
python3 scripts/train.py --world-size 8 --num-epochs 100 --start-epoch 1 --use-fp16 true --max-duration 550 --enable-musan true --use-validated-set true --bpe-model $data_dir/lang_bpe_500/bpe.model --manifest-dir $data_dir/fbank --exp-dir $base_dir
Decoding:
python3 scripts/decode.py --epoch 100 --avg 1 --max-duration 550 --decode-chunk-len 32 --decoding-method modified_beam_search --use-averaged-model false --bpe-model $lang_dir/bpe.model --lang-dir $lang_dir --manifest-dir $data_dir/fbank --exp-dir $base_dir
Exporting:
python3 scripts/export-onnx.py --epoch 100 --avg 1 --use-averaged-model false --tokens $data_dir/lang_bpe_500/tokens.txt --exp-dir $base_dir
Could you also share the logs for the above 3 commands? (You can find them from the terminal output. Please post the first few lines of them where configuration arguments can be found.)
Could you also share the logs for the above 3 commands? (You can find them from the terminal output. Please post the first few lines of them where configuration arguments can be found.)
Logs:
train_log.txt
decode_log.txt
export-onnx_log.txt
It was made on --world-size 8
so I give only cuda:0
.
Also --num-epochs
in train.py
is 50
, and --epoch
in decode.py
and export-onnx.py
is also 50
, not 100
as I posted before, because I noticed that I trained two times: from epoch 1 to 50 and then from 51 to 100.
By the way, are you using the latest icefall and latest sherpa-onnx ?
Yes, I used docker image (torch2.2.2-cuda12.1) with icefall and after training I tested it with sherpa-onnx built from latest GitHub source.
@csukuangfj Hello! Could You help me with this issue ? I shared the logs as You asked in the post before. Thanks in advance.
I don't see anything abnormal in your logs.
Sorry that I have no idea why greedy search works but modified_beam_search does not.
(Could you share your model files so that we can reproduce it and debug it locally?)
I don't see anything abnormal in your logs. Sorry that I have no idea why greedy search works but modified_beam_search does not.
Aha, I see, it's great that logs are OK.
Could you share your model files so that we can reproduce it and debug it locally?
Which model files should I share ? Do you mean exported encoder, decoder and joiner with .onnx extension and tokens.txt ?
Do you mean exported encoder, decoder and joiner with .onnx extension and tokens.txt ?
Yes. Please also share a test wave file.
Could you share your model files so that we can reproduce it and debug it locally? Please also share a test wave file.
Please see my shared folder with model and test wave files: link here
When I set decoding method to 'modified_beam_search' it returns letters/words on silence, after short time, periodically. But when I use default 'greedy_search' decoding method everything works well. As an example I used Python script for real-time speech recognition from a microphone with endpoint detection from here. I checked and the same situation also happens in Java API with C++ dynamic libraries loaded. The model used for recognition was created by me, using a modified icefall Common Voice recipe (with MLS and VoxPopuli datasets, 491 min of audio), on icefall docker image (torch2.2.2-cuda12.1), and has a WER of 4.41%.