Performance gap between icefall local streaming decoding and sherpa streaming decoding

k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi

https://k2-fsa.github.io/sherpa

Apache License 2.0

518 stars 104 forks source link

Performance gap between icefall local streaming decoding and sherpa streaming decoding #91

Open shaynemei opened 2 years ago

shaynemei commented 2 years ago

Using the same model (a streaming pruned_transducer_stateless5 trained on gigaspeech), we are experiencing some performance gap between local icefall streaming decoding and sherpa server streaming decoding. WERs for both setup are calculated using the same function here: https://github.com/k2-fsa/icefall/blob/5149788cb2e0730d1537b9711dcfc5c4b11a0f4b/egs/librispeech/ASR/pruned_transducer_stateless5/decode.py#L597-L638

tedlium_dev: local batch decoding: 4.35 local streaming decoding: 4.72 sherpa server streaming decoding: 5.72

csukuangfj commented 2 years ago

Could you compare the decoded results among them? You can use vimdiff to compare the file recogs-xxx.txt.

Are there many <UNK>s in sherpa based decoding for TEDLIUM?

pkufool commented 2 years ago

@shaynemei Did you use decode-right-context=2 (the default value) in sherpa. If so, please try decode-right-context=0. We found that not all models can benefit from right context.

pkufool commented 2 years ago

Also, can you show your decoding command for local batch decoding and local streaming decoding, I think the WER difference between them is a little large. Thanks!

shaynemei commented 2 years ago

local batch decoding command:

./pruned_transducer_stateless5/decode.py \
  --epoch 4 \
  --avg 1 \
  --simulate-streaming False \
  --causal-convolution True \
  --use-averaged-model False

local streaming decoding command:

./pruned_transducer_stateless5/decode.py \
  --epoch 4 \
  --avg 1 \
  --simulate-streaming True \
  --causal-convolution True \
  --use-averaged-model False

shaynemei commented 2 years ago

Actually there isn't any s in sherpa based decoding for TEDLIUM

shaynemei commented 2 years ago

the utts in the two recogs.txt aren't in the same order, so I couldn't use vimdiff

shaynemei commented 2 years ago

@shaynemei Did you use decode-right-context=2 (the default value) in sherpa. If so, please try decode-right-context=0. We found that not all models can benefit from right context.

@pkufool

I reran TEDLIUM_DEV with no right context and got WER: 5.00

Is this 0.28 gap with local streaming (WER 4.72) expected for sherpa?

shaynemei commented 2 years ago

@csukuangfj @danpovey @pkufool just following up on this issue. Is there anything else I should provide?

csukuangfj commented 2 years ago

Sorry, I have not looked into it yet. I need to reproduce it locally first.

shaynemei commented 1 year ago

Do you need any help / additional information for you reproduce it?

csukuangfj commented 1 year ago

Sorry for the late reply. Will look into it during the holiday.

uni-sagar-raikar commented 1 year ago

@csukuangfj Do we have any update on this issue? I am seeing a lot of deletion errors with sherpa decoding of streaming zipformer model.

-Sagar