k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.53k stars 414 forks source link

Availability of different beam search as icefall #465

Open bhaswa opened 11 months ago

bhaswa commented 11 months ago

Hi,

In icefall, there are multiple decoding methods available, eg. greedy_search, beam_search, modified_beam_search, fast_beam_search, fast_beam_search_nbest. There are some other decoding methods for LM as well (modified_beam_search_lm_shallow_fusion, modified_beam_search_LODR, modified_beam_search_lm_rescore, modified_beam_search_lm_rescore_LODR). But in sherpa onnx, there are only two valid decoded methods (greedy_search and modified_beam_search) Can we use the other decoding methods same as icefall in sherpa onnx as well ?

csukuangfj commented 11 months ago

I am afraid you cannot. We have implemented only greedy_search and modified_beam_search for transducer models.

fast_beam_search requires k2 but sherpa-onnx does not depend on k2.

bhaswa commented 11 months ago

In case LM is used, LODR, Rescoring or shallow fusion also cannot be used in sherpa onnx ?

csukuangfj commented 11 months ago

No, you can use RNN lm rescoring with sherpa-onnx.

Please search for the PR for rnnlm rescoring in sherpa-onnx. There are usages in the comments of that PR.

bhaswa commented 11 months ago

So by default, if I use --lm and --decoding-method=modified_beam_search, it will be lm rescoring ?

csukuangfj commented 11 months ago

You need to pass the rnnlm model

bhaswa commented 11 months ago

Yes. rnnlm need to be provided

bhaswa commented 11 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/353

From the above pull request, it seems that shallow fusion is also implemented. Can you provide the usage for the same ?

csukuangfj commented 11 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/147

Please search for shallow fusion in the related PR. You can find usages in the comments.

bhaswa commented 11 months ago

https://github.com/k2-fsa/sherpa-onnx/pull/125 From the above PR I found the usage for LM rescore as below:

./build/bin/sherpa-onnx-offline \ --tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \ --encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \ --decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \ --joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \ --lm-scale=0.5 \ --num-threads=2 \ --decoding-method=modified_beam_search \ --max-active-paths=4 \ ./2414-159411-0024.wav \

https://github.com/k2-fsa/sherpa-onnx/pull/147 From this PR I found the usage of shallow fusion as below: ./bin/sherpa-onnx exp/data/lang_char_bpe/tokens.txt exp/exp/encoder-epoch-99-avg-1.onnx exp/exp/decoder-epoch-99-avg-1.onnx exp/exp/joiner-epoch-99-avg-1.onnx exp/test_wavs/BAC009S0764W0164.wav 2 modified_beam_search exp/exp/with-state-epoch-999-avg-1.onnx

From the above two commands, I found difference only in the executable. I could not find any difference in the arguments passed to differentiate between rescoring or shallow fusion.

If I want to run the python API, how can I differentiate between rescoring and shallow fusion ?

csukuangfj commented 11 months ago

From the above PR I found the usage for LM rescore as below:

Please take a look at the usage in the PR comment. You have found the wrong place in the PR.

csukuangfj commented 11 months ago

Screenshot 2023-12-05 at 18 45 12

bhaswa commented 11 months ago

My bad. I copied the wrong segment.

But still I cannot find any difference in the arguments from https://github.com/k2-fsa/sherpa-onnx/pull/125 (LM rescore) and https://github.com/k2-fsa/sherpa-onnx/pull/147 (shallow fusion)

I want to run the python API. How can I differentiate between rescoring and shallow fusion ?

csukuangfj commented 11 months ago

between rescoring and shallow fusion

Could you explain the difference between rescoring and shallow fusion?

bhaswa commented 11 months ago

In Icefall, we can use LM with rescoring and shallow fusion.

The command for shallow fusion is ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --beam-size 4 \ --exp-dir $exp_dir \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search_lm_shallow_fusion \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --use-shallow-fusion 1 \ --lm-type rnn \ --lm-exp-dir $lm_dir \ --lm-epoch 99 \ --lm-scale $lm_scale \ --lm-avg 1 \ --rnn-lm-embedding-dim 2048 \ --rnn-lm-hidden-dim 2048 \ --rnn-lm-num-layers 3 \ --lm-vocab-size 500

The command for rescoring is: ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --beam-size 4 \ --exp-dir $exp_dir \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search_lm_rescore \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --use-shallow-fusion 0 \ --lm-type rnn \ --lm-exp-dir $lm_dir \ --lm-epoch 99 \ --lm-scale $lm_scale \ --lm-avg 1 \ --rnn-lm-embedding-dim 2048 \ --rnn-lm-hidden-dim 2048 \ --rnn-lm-num-layers 3 \ --lm-vocab-size 500

In sherpa onnx, how can I use LM with these two different settings? Also with the given command in sherpa onnx pull requests (https://github.com/k2-fsa/sherpa-onnx/pull/125 and https://github.com/k2-fsa/sherpa-onnx/pull/147), LM will run with rescoring or shallow fusion?