k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
515 stars 103 forks source link

Inconsistent behavior of sherpa decoder setup #401

Open uni-sagar-raikar opened 1 year ago

uni-sagar-raikar commented 1 year ago

Hi @csukuangfj , We are trying to benchmark offline icefall native decoding with ".pt" model vs sherpa offline websocket server decoding with torchscript model. We see that there is inconsistency in decoding results from both for same audios. Majority of errors are deletions happening at the beginning of segment.

Here are some samples:

sherpa decoding: id: (a1) Scores: (#C #S #D #I) 0 1 5 0 REF: YOU KNOW THAT KIND OF STUFF HYP: * ** ** OK

id: (a2) Scores: (#C #S #D #I) 0 1 3 0 REF: YEAH GO AHEAD BOB HYP: ** ***** YE

id: (a3) Scores: (#C #S #D #I) 0 1 5 0 REF: YEAH I CAN HEAR YOU NOW HYP: * * * YES

icefall decoding: id: (a1) HYP: you know that kind of stuff id: (a2) HYP: go ahead mark id: (a3) HYP: yeah i can hear you now

Is this due to some constraint on context in the sherpa decoder? Thanks in advance.

@csukuangfj edited, I had posted results wrongly earlier. sherpa is worser compared to icefall. Apologies

-Sagar

csukuangfj commented 1 year ago

Looks to me there is something wrong with the decoding in icefall.

Could you post the decoding commands with icefall?

uni-sagar-raikar commented 1 year ago

@csukuangfj I just updated the above msg, infact sherpa is worser compared to icefall.

csukuangfj commented 1 year ago

ok, then there must something wrong with sherpa.

Could you use ./bin/sherpa-online to decode the files and post the decoded results?

uni-sagar-raikar commented 1 year ago

Here is the offline websocket server command we are using: sherpa-offline-websocket-server \ --doc-root=/workspace/sherpa/sherpa/bin/web \ --nn-model=/cpu_jit.pt \ --tokens=/tokens.txt \ --sample-frequency=8000 \ --num-work-threads=10 \ --max-batch-size=100 \ --use-gpu=true \ --port=6007

for client we are using: [decode_manifest.py](https://github.com/k2-fsa/sherpa/blob/master/sherpa/bin/pruned_transducer_statelessX/decode_manifest.py)
csukuangfj commented 1 year ago

--sample-frequency=8000

Is your model trained using 8kHz data?

uni-sagar-raikar commented 1 year ago

yeah, we have trained models using 8k and have taken care of hardcodings on sample_rate across sherpa codebase.

csukuangfj commented 1 year ago

ok, then there must something wrong with sherpa.

Could you use ./bin/sherpa-online to decode the files and post the decoded results?

Ok, please try this one.

uni-sagar-raikar commented 1 year ago

How is sherpa-online going to help? We are looking for offline decoder. With "sherpa-offline" also, same deletion issue persists.

csukuangfj commented 1 year ago

How is sherpa-online going to help

Just want to check you did not make mistakes with the websocket server and client.

uni-sagar-raikar commented 1 year ago

I tried with sherpa-offline tool. This is a standalone offline decoder and it also has same issue as websocket server (Deletions) Is there some difference between icefall feature extraction & decoding vs sherpa feat extraction & decoding

csukuangfj commented 1 year ago

Could you show the complete logs for invoking ./bin/sherpa-offline? Also, could you post the output of

soxi your_test.wav
uni-sagar-raikar commented 1 year ago

Here are the logs:

[I] /workspace/sherpa/sherpa/csrc/parse-options.cc:495:int sherpa::ParseOptions::Read(int, const char const) 2023-06-13 10:11:47.560 sherpa-offline --nn-model=/mnt/efs/sagar/icefall/e2e_gpu_batch_asr_model_en-global_v2024.1.1.0/cpu_jit.pt --tokens=/mnt/efs/sagar/icefall/e2e_gpu_batch_asr_model_en-global_v2024.1.1.0/tokens.txt --use-gpu=true --sample-frequency=8000 a3.wav

[I] /workspace/sherpa/sherpa/cpp_api/bin/offline-recognizer.cc:126:int main(int, char**) 2023-06-13 10:11:47.569 OfflineRecognizerConfig(ctc_decoder_config=OfflineCtcDecoderConfig(modified=True, hlg="", lm_scale=1, search_beam=20, output_beam=8, min_active_states=30, max_active_states=10000), feat_config=FeatureConfig(fbank_opts=FbankOptions(frame_opts=FrameExtractionOptions(samp_freq=8000, frame_shift_ms=10, frame_length_ms=25, dither=0, preemph_coeff=0.97, remove_dc_offset=True, window_type="povey", round_to_power_of_two=True, blackman_coeff=0.42, snip_edges=True, max_feature_vectors=-1), mel_opts=MelBanksOptions(num_bins=80, low_freq=20, high_freq=0, vtln_low=100, vtln_high=-500, debug_mel=False, htk_mode=False), use_energy=False, energy_floor=0, raw_energy=True, htk_compat=False, use_log_fbank=True, use_power=True, device="cpu"), normalize_samples=True, nemo_normalize=""), nn_model="/mnt/efs/sagar/icefall/e2e_gpu_batch_asr_model_en-global_v2024.1.1.0/cpu_jit.pt", tokens="/mnt/efs/sagar/icefall/e2e_gpu_batch_asr_model_en-global_v2024.1.1.0/tokens.txt", use_gpu=True, decoding_method="greedy_search", num_active_paths=4) [I] /workspace/sherpa/sherpa/cpp_api/offline-recognizer-transducer-impl.h:126:void sherpa::OfflineRecognizerTransducerImpl::WarmUp() 2023-06-13 10:11:50.195 WarmUp begins [I] /workspace/sherpa/sherpa/cpp_api/offline-recognizer-transducer-impl.h:139:void sherpa::OfflineRecognizerTransducerImpl::WarmUp() 2023-06-13 10:11:51.132 WarmUp ended

filename: a3.wav text: yes token IDs: yes timestamps (after subsampling): 0

soxi: Input File : 'a3.wav' Channels : 1 Sample Rate : 8000 Precision : 16-bit Duration : 00:00:00.97 = 7760 samples ~ 72.75 CDDA sectors File Size : 15.6k Bit Rate : 128k Sample Encoding: 16-bit Signed Integer PCM

csukuangfj commented 1 year ago

So this test wav is correctly decoded, right?

uni-sagar-raikar commented 1 year ago

this is correctly decoded in icefall but not in sherpa.

csukuangfj commented 1 year ago

What is the correct transcript of this test wave? It is decoded as yes.

uni-sagar-raikar commented 1 year ago

Hi @csukuangfj ,

As per the updates from https://github.com/k2-fsa/icefall/issues/1006 , with new zipformer the discrepancies between icefall and sherpa have decreased? We are seeing some mismatch in the encoder outputs under sherpa compared to what we see in normal icefall decoding.

csukuangfj commented 1 year ago

Are you using the latest icefall and sherpa? @uni-sagar-raikar

uni-sagar-raikar commented 1 year ago

Yes, but we are using old-zipformer models and seeing some discrepancies with them. Yet to compare them with new zipformer models.