k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
552 stars 109 forks source link

Server arguments for Streaming Decoding Pruned_Transducer_Stateless5 #179

Closed teowenshen closed 2 years ago

teowenshen commented 2 years ago

I have trained (another) pruned_transducer_stateless5 model on a transcript without disfluencies and the results in offline decoding are better (decode.py of Icefall). Then, I move it to Sherpa to try streaming decoding of the same wav audio file, but the results are very different.

In Sherpa, I initially used the default decode-chunk-size=8 and decode-left-context=32. This yielded results comparable to decode.py of Icefall. I guess it's because the segment boundaries are different, so the model had different feature vectors to convolve.

Then, I set decode-left-context=64 and decode-chunk-size=16 to match that of train.py and decode.py. This time, the results were very bad with a lot of deletions.

Are the left-context and chunk-size meant to be different in Sherpa? Also, what should the decode-right-context argument in Sherpa server be?

These are my training, decoding, and sherpa server commands.

Training command

python pruned_transducer_stateless5/train.py \
  --exp-dir pruned_transducer_stateless5/exp_natural \
  --num-encoder-layers 18 \
  --dim-feedforward 2048 \
  --nhead 8 \
  --encoder-dim 512 \
  --decoder-dim 512 \
  --joiner-dim 512 \
  --dynamic-chunk-training 1 \
  --causal-convolution 1 \
  --short-chunk-size 20 \
  --num-left-chunks 4 \
  --max-duration 125 \
  --world-size 8 \
  --start-epoch 31 \
  --num-epochs 10 \
  --transcript-mode natural \
  --context-size 3 \
  --telegram-cred misc.ini \
  --word-table lang_char_natural/words.txt \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

Decoding command

for method in fast_beam_search modified_beam_search; do
    for avg in 20 1 15 5 10 17 13 7; do
        ./pruned_transducer_stateless5/decode.py \
            --num-encoder-layers 18 \
            --dim-feedforward 2048 \
            --nhead 8 \
            --encoder-dim 512 \
            --decoder-dim 512 \
            --joiner-dim 512 \
            --context-size 3 \
            --simulate-streaming 1 \
            --decode-chunk-size 16 \
            --left-context 64 \
            --causal-convolution 1 \
            --epoch 40 \
            --avg $avg \
            --exp-dir pruned_transducer_stateless5/exp_natural \
            --max-sym-per-frame 1 \
            --max-duration 200 \
            --decoding-method $method \
            --beam-size 4  \
            --word-table lang_char_natural/words.txt \
            --transcript-mode natural
    done
done

Sherpa server

./sherpa/bin/streaming_pruned_transducer_statelessX_tokenizer/streaming_server.py --port 6006 \
    --max-batch-size 50 --max-wait-ms 5 --nn-pool-size 1 \
    --nn-model-filename /mnt/host/icefall/egs/csj/ASR/pruned_transducer_stateless5/exp_natural/cpu_jit.pt \
    --token-filename /mnt/host/icefall/egs/csj/ASR/lang_char_natural/words.txt \
    --decoding-method modified_beam_search \
    --decode-chunk-size 16 \
    --decode-left-context 64 
csukuangfj commented 2 years ago

Could you please show the command about the generation of cpu_jit.pt?

teowenshen commented 2 years ago

Yes. Please find it as below. What do you think might go wrong with the export command?

python pruned_transducer_stateless5/export.py \
    --exp-dir pruned_transducer_stateless5/exp_natural \
    --word-table lang_char_natural/words.txt --epoch 40 \
    --avg 5 --jit 1 --num-encoder-layers 18 --dim-feedforward 2048 \
    --nhead 8 --encoder-dim 512 --decoder-dim 512 --joiner-dim 512 \
    --context-size 3 

That reminds me - my version of Librispeech's pruned_transducer_stateless5/export.py did not have convert_scaled_to_non_scaled(model, inplace=True) yet, so exporting didn't initially work due to a random component. I should send in a pull request to add that.

csukuangfj commented 2 years ago

By the way, you may also need to set https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/export.py#L141

"--streaming-model",

to True.

teowenshen commented 2 years ago

Thanks! The results are more similar to the results in offline decoding, when --streaming-model and --causal convolution are added.

Should I align the decode-left-context and decode-chunk-size too to the training parameters?

csukuangfj commented 2 years ago

Should I align the decode-left-context and decode-chunk-size too to the training parameters?

You can tune these parameters separately during decoding, I think. They won't affect how the model is exported.

teowenshen commented 2 years ago

I see. Thanks for sharing your advice! Somehow, I find chunk sizes of 8 to do the best so far. Adding lookahead chunks added more insertions as duplicate terms. Not sure why.

Anyhow, I will close this issue now. Thanks for your help!