Closed teowenshen closed 2 years ago
Could you please show the command about the generation of cpu_jit.pt?
Yes. Please find it as below. What do you think might go wrong with the export command?
python pruned_transducer_stateless5/export.py \
--exp-dir pruned_transducer_stateless5/exp_natural \
--word-table lang_char_natural/words.txt --epoch 40 \
--avg 5 --jit 1 --num-encoder-layers 18 --dim-feedforward 2048 \
--nhead 8 --encoder-dim 512 --decoder-dim 512 --joiner-dim 512 \
--context-size 3
That reminds me - my version of Librispeech's pruned_transducer_stateless5/export.py did not have convert_scaled_to_non_scaled(model, inplace=True)
yet, so exporting didn't initially work due to a random component.
I should send in a pull request to add that.
By the way, you may also need to set https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/export.py#L141
"--streaming-model",
to True
.
Thanks! The results are more similar to the results in offline decoding, when --streaming-model
and --causal convolution
are added.
Should I align the decode-left-context
and decode-chunk-size
too to the training parameters?
Should I align the decode-left-context and decode-chunk-size too to the training parameters?
You can tune these parameters separately during decoding, I think. They won't affect how the model is exported.
I see. Thanks for sharing your advice! Somehow, I find chunk sizes of 8 to do the best so far. Adding lookahead chunks added more insertions as duplicate terms. Not sure why.
Anyhow, I will close this issue now. Thanks for your help!
I have trained (another) pruned_transducer_stateless5 model on a transcript without disfluencies and the results in offline decoding are better (decode.py of Icefall). Then, I move it to Sherpa to try streaming decoding of the same wav audio file, but the results are very different.
In Sherpa, I initially used the default
decode-chunk-size=8
anddecode-left-context=32
. This yielded results comparable to decode.py of Icefall. I guess it's because the segment boundaries are different, so the model had different feature vectors to convolve.Then, I set
decode-left-context=64
anddecode-chunk-size=16
to match that of train.py and decode.py. This time, the results were very bad with a lot of deletions.Are the left-context and chunk-size meant to be different in Sherpa? Also, what should the
decode-right-context
argument in Sherpa server be?These are my training, decoding, and sherpa server commands.
Training command
Decoding command
Sherpa server