k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.17k stars 371 forks source link

transcription inconsistency in different runs #630

Closed ziggy1209 closed 6 months ago

ziggy1209 commented 6 months ago

Built sherpa-onnx-offline successfully following the guide, but a problem occurs when I try to run the inference: the transcription is different from run to run.

E.g., once I obtained the following result:

./bin/sherpa-onnx-offline --whisper-encoder=./distil-medium.en-encoder.int8.onnx   --whisper-decoder=./distil-medium.en-decoder.int8.onnx   --tokens=./distil-medium.en-tokens.txt   --provider=cuda   ./0.wav   ./1.wav   ./8k.wav
/home/whisper/sherpa-onnx-master/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/sherpa-onnx-offline --whisper-encoder=./distil-medium.en-encoder.int8.onnx --whisper-decoder=./distil-medium.en-decoder.int8.onnx --tokens=./distil-medium.en-tokens.txt --provider=cuda ./0.wav ./1.wav ./8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="./distil-medium.en-encoder.int8.onnx", decoder="./distil-medium.en-decoder.int8.onnx", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./distil-medium.en-tokens.txt", num_threads=2, debug=False, provider="cuda", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0)
Creating recognizer ...
Started
/home/whisper/sherpa-onnx-master/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:119 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./0.wav
{"text": " I'm", "timestamps": [], "tokens":[" I", "'m"]}
----
./1.wav
{"text": " God, as a direct consequence of the sin which man thus punished, had given her a lovely child whose place was on that same dishonored bosom to connect her parent forever with the race and descent of mortals, and to be finally a blessed soul in heaven.", "timestamps": [], "tokens":[" God", ",", " as", " a", " direct", " consequence", " of", " the", " sin", " which", " man", " thus", " punished", ",", " had", " given", " her", " a", " lovely", " child", " whose", " place", " was", " on", " that", " same", " dishon", "ored", " bos", "om", " to", " connect", " her", " parent", " forever", " with", " the", " race", " and", " descent", " of", " mortals", ",", " and", " to", " be", " finally", " a", " blessed", " soul", " in", " heaven", "."]}
----
./8k.wav
{"text": " I'm", "timestamps": [], "tokens":[" I", "'m"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.960 s
Real time factor (RTF): 6.960 / 28.165 = 0.247

and in another run I obtained the following result:

./bin/sherpa-onnx-offline --whisper-encoder=./distil-medium.en-encoder.int8.onnx   --whisper-decoder=./distil-medium.en-decoder.int8.onnx   --tokens=./distil-medium.en-tokens.txt   --provider=cuda   ./0.wav   ./1.wav   ./8k.wav
/home/whisper/sherpa-onnx-master/sherpa-onnx/csrc/parse-options.cc:Read:361 ./bin/sherpa-onnx-offline --whisper-encoder=./distil-medium.en-encoder.int8.onnx --whisper-decoder=./distil-medium.en-decoder.int8.onnx --tokens=./distil-medium.en-tokens.txt --provider=cuda ./0.wav ./1.wav ./8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="./distil-medium.en-encoder.int8.onnx", decoder="./distil-medium.en-decoder.int8.onnx", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./distil-medium.en-tokens.txt", num_threads=2, debug=False, provider="cuda", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0)
Creating recognizer ...
Started
/home/whisper/sherpa-onnx-master/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:119 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./0.wav
{"text": " I'm", "timestamps": [], "tokens":[" I", "'m"]}
----
./1.wav
{"text": " I'm here.", "timestamps": [], "tokens":[" I", "'m", " here", "."]}
----
./8k.wav
{"text": " I'm", "timestamps": [], "tokens":[" I", "'m"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 6.606 s
Real time factor (RTF): 6.606 / 28.165 = 0.235

Any idea why such inconsistency (and mis-transcription) exists? Thanks in advance!

csukuangfj commented 6 months ago

Please use the latest master and try again.

csukuangfj commented 6 months ago

/home/whisper/sherpa-onnx-master/sherpa-onnx

Please re-download the master code.

ziggy1209 commented 6 months ago

/home/whisper/sherpa-onnx-master/sherpa-onnx

Please re-download the master code.

Thanks! A re-download has fixed the issue.