k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.08k stars 355 forks source link

Availability of different beam search as icefall #465

Open bhaswa opened 9 months ago

bhaswa commented 9 months ago

Hi,

In icefall, there are multiple decoding methods available, eg. greedy_search, beam_search, modified_beam_search, fast_beam_search, fast_beam_search_nbest. There are some other decoding methods for LM as well (modified_beam_search_lm_shallow_fusion, modified_beam_search_LODR, modified_beam_search_lm_rescore, modified_beam_search_lm_rescore_LODR). But in sherpa onnx, there are only two valid decoded methods (greedy_search and modified_beam_search) Can we use the other decoding methods same as icefall in sherpa onnx as well ?

csukuangfj commented 9 months ago

I am afraid you cannot. We have implemented only greedy_search and modified_beam_search for transducer models.

fast_beam_search requires k2 but sherpa-onnx does not depend on k2.

bhaswa commented 9 months ago

In case LM is used, LODR, Rescoring or shallow fusion also cannot be used in sherpa onnx ?

csukuangfj commented 9 months ago

No, you can use RNN lm rescoring with sherpa-onnx.

Please search for the PR for rnnlm rescoring in sherpa-onnx. There are usages in the comments of that PR.

bhaswa commented 9 months ago

So by default, if I use --lm and --decoding-method=modified_beam_search, it will be lm rescoring ?

csukuangfj commented 9 months ago

You need to pass the rnnlm model

bhaswa commented 9 months ago

Yes. rnnlm need to be provided

bhaswa commented 9 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/353

From the above pull request, it seems that shallow fusion is also implemented. Can you provide the usage for the same ?

csukuangfj commented 9 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/147

Please search for shallow fusion in the related PR. You can find usages in the comments.

bhaswa commented 9 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/125 From the above PR I found the usage for LM rescore as below:

./build/bin/sherpa-onnx-offline \ --tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \ --encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \ --decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \ --joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \ --lm-scale=0.5 \ --num-threads=2 \ --decoding-method=modified_beam_search \ --max-active-paths=4 \ ./2414-159411-0024.wav \

https://github.com/k2-fsa/sherpa-onnx/pull/147 From this PR I found the usage of shallow fusion as below: ./bin/sherpa-onnx exp/data/lang_char_bpe/tokens.txt exp/exp/encoder-epoch-99-avg-1.onnx exp/exp/decoder-epoch-99-avg-1.onnx exp/exp/joiner-epoch-99-avg-1.onnx exp/test_wavs/BAC009S0764W0164.wav 2 modified_beam_search exp/exp/with-state-epoch-999-avg-1.onnx

From the above two commands, I found difference only in the executable. I could not find any difference in the arguments passed to differentiate between rescoring or shallow fusion.

If I want to run the python API, how can I differentiate between rescoring and shallow fusion ?

csukuangfj commented 9 months ago

From the above PR I found the usage for LM rescore as below:

Please take a look at the usage in the PR comment. You have found the wrong place in the PR.

csukuangfj commented 9 months ago

Screenshot 2023-12-05 at 18 45 12

bhaswa commented 9 months ago

My bad. I copied the wrong segment.

But still I cannot find any difference in the arguments from https://github.com/k2-fsa/sherpa-onnx/pull/125 (LM rescore) and https://github.com/k2-fsa/sherpa-onnx/pull/147 (shallow fusion)

I want to run the python API. How can I differentiate between rescoring and shallow fusion ?

csukuangfj commented 9 months ago

between rescoring and shallow fusion

Could you explain the difference between rescoring and shallow fusion?

bhaswa commented 9 months ago

In Icefall, we can use LM with rescoring and shallow fusion.

The command for shallow fusion is ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --beam-size 4 \ --exp-dir $exp_dir \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search_lm_shallow_fusion \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --use-shallow-fusion 1 \ --lm-type rnn \ --lm-exp-dir $lm_dir \ --lm-epoch 99 \ --lm-scale $lm_scale \ --lm-avg 1 \ --rnn-lm-embedding-dim 2048 \ --rnn-lm-hidden-dim 2048 \ --rnn-lm-num-layers 3 \ --lm-vocab-size 500

The command for rescoring is: ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --beam-size 4 \ --exp-dir $exp_dir \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search_lm_rescore \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --use-shallow-fusion 0 \ --lm-type rnn \ --lm-exp-dir $lm_dir \ --lm-epoch 99 \ --lm-scale $lm_scale \ --lm-avg 1 \ --rnn-lm-embedding-dim 2048 \ --rnn-lm-hidden-dim 2048 \ --rnn-lm-num-layers 3 \ --lm-vocab-size 500

In sherpa onnx, how can I use LM with these two different settings? Also with the given command in sherpa onnx pull requests (https://github.com/k2-fsa/sherpa-onnx/pull/125 and https://github.com/k2-fsa/sherpa-onnx/pull/147), LM will run with rescoring or shallow fusion?