k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
885 stars 286 forks source link

Question about replicating normalization: decode.py vs streaming_decode.py (streaming zipformer) #1006

Open AdolfVonKleist opened 1 year ago

AdolfVonKleist commented 1 year ago

I'm looking for hints about how to correctly replicate the exact normalization process that is applied during simulated streaming with decode.py in streaming_decode.py for the streaming zipformer.

I have been getting some really great results with the streaming zipformer and large datasets lately. I have been primarily using the default decode.py and simulated streaming for evals because it is very fast, and the delta between simulated and true streaming in the RESULTS.md pages appears consistent and pretty small. However I noticed that there appears to be a pretty big delta between what I get with decode.py and sherpa_onnx, and in particular that some utterances which decode with perfect or near perfect accuracy in the decode.py eval later produce empty hypotheses in sherpa_onnx. I started debugging this thinking it was something I had done in sherpa (still a distinct possibility) and found that applying some volume normalization via ffmpeg to the input could have a significant impact on the same utterances.

Next I tried to go a little further back and run streaming_decode.py to compare any possible differences there with the output I have been seeing with decode.py. Here I immediately ran into this audio.max assertion error when trying to decode my test set in icefall:

I'm using the exact same cutset that performs very well with simulated streaming (via decode.py) but when I try to run it with streaming_decode.py it raises this assertion error. If I comment out the assertion, decoding runs and there is actually not much impact on WER (4.98% for simulated vs 5.04% for true streaming in streaming_decode.py), however I'd like to plug this gap.

I spent some time reviewing the code but I didn't find an obvious answer: how should I ensure that these same cuts are appropriately normalized for streaming_decode.py so as not to fire this assertion? I'm also wondering how/if this might play into the larger gap I'm seeing between these evals and the performance I see with sherpa-onnx (roughly 3% worse).

csukuangfj commented 1 year ago

Here I immediately ran into this audio.max assertion error when trying to decode my test set in icefall:

Are you using resampling ?

AdolfVonKleist commented 1 year ago

@csukuangfj yes my original training data (and test set) contains a mixture of different codecs and samplerates. In sherpa-onnx I am explicitly resampling all test data to 16khz as well. Do decode.py and streaming_decode.py behave differently in this regard?

csukuangfj commented 1 year ago

Both decode.py and streaming_decode.py use lhotse for resampling, which uses torchaduio internally for resampling.

sherpa-onnx uses its own resampling, LinearResampler from kaldi.

Maybe the difference comes from how resampling is implemented.

To verify that, could you find a wave that is not correctly recognized by sherpa-onnx but could be correctly recognized in icefall and you manually resample this wave using torchaudio and use sherpa-onnx to decode this resampled file?

AdolfVonKleist commented 1 year ago

@csukuangfj thank you for the hint. This resolved some, but not all of the differences. I will continue to debug it and see if I can find anything else. There seems to be quite a lot of sensitivity to this; I get slightly different results with each resampler (sox, ffmpeg with swr resampler, ffmpeg with sox resampler, and torchaudio [which I guess uses either sox_io or soundfile backends]).

brainbpe commented 1 year ago

I encounter the same question,and my training data are original 16k, no any resample operation, when using simulated streaming (via decode.py with chunk-len=32) I get wer 5%,but when I use sherpa-online or sherpa-online-websocket-server, the wer is 8%, there is 3% absolute gap,and most of the gap is delete error,what's the possible reason for this? @csukuangfj @pingfengluo , is any progress of your debug? @AdolfVonKleist

csukuangfj commented 1 year ago

Where are the deletion errors? Are they mostly at the end of the utterance?

brainbpe commented 1 year ago

mostly at the start of the utterance

brainbpe commented 1 year ago

and some short utterance produce empty hypotheses also

ezerhouni commented 1 year ago

@brainbpe I think I am having the same issue as you (with even worst wer differences), did you find out where it was coming from ? cc @AdolfVonKleist

brainbpe commented 1 year ago

@ezerhouni no, I don't have enough time to fix this. BTW: I find this problem in "sherpa",and you find in "sherpa-onnx",so I think this may be a bug in feature extract or in model export, cc @csukuangfj

AdolfVonKleist commented 1 year ago

I stopped receiving messages from these threads for some reason. I am still seeing this issue and have not yet been able to resolve it. I see there are a couple of other mentions of similar issues with sherpa:

@csukuangfj the deletions seem to take place at the beginning and sometimes even in the middle. I managed to also somewhat consistently improve the results by playing with the volume normalization but this seems like an inappropriate approach.

@brainbpe @ezerhouni

w11wo commented 1 year ago

Hi, I am also finding the same issues as discussed above, but in sherpa-ncnn. Similarities include:

Hoping to see a solution for this issue soon! :)

AdolfVonKleist commented 1 year ago

@w11wo have you gotten any further with this or come up with any other ideas? I see the exact same, quite consistent behavior: incorporating an ffmpeg command into the pipeline that supports the volume=6dB (or whatever setting) consistently improves but not fully resolve this issue. I still think it may be a minor difference between the behavior of lhotse during the training process and the behavior of sherpa-onnx/ncnn/etc during inference. I also noticed that when resampling is employed there was often a significant difference in output just between utilization of swr versus soxr (the latter of which reproduces the sox resampling which I think is what lhotse is doing).

this combined with the volume tweaking/normalization seem to have significant impact on the outcome. So far however I have still failed to 100% replicate the results in sherpa-onnx. The issue seems to be very consistently to do with deletions and nothing else (which is also reported in 3-4 other similar issues as reported also be @ezerhouni and others). It would really be great to find a resolution to this as if I discount this issue the performance I'm currently getting out of sherpa/icefall/k2 is really, really amazing.

w11wo commented 1 year ago

Hi @AdolfVonKleist.

Unfortunately, I have not been able to figure this out. I provided a somewhat reproducible example here, which is still awaiting a response from the icefall team.

I think the example doesn't do it much justice though. Since I've seen worse deletions issues with my own, private models. But I do think the underlying issue is the same, regardless of the model used.

Moreover, testing on multiple mobile devices replicate the same issue. At times, we have to speak super loudly near the microphone of e.g. an iPad, to get it to recognize at all. But in other devices, this doesn't become an issue.

yaozengwei commented 1 year ago

@AdolfVonKleist Did you get difference between decode.py and streaming_decode.py for a same audio?

yaozengwei commented 1 year ago

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

AdolfVonKleist commented 1 year ago

@yaozengwei I only note the differences with sherpa-onnx (I previously observed similar issues with ncnn but I moved away from this in the end as sherpa-onnx bindings tend to produce better RTFs in my experients).

latest recipe zipformer

I will do this and see if the issues disappear. Is there now support for the latest streaming zipformer in sherpa-onnx as well?

csukuangfj commented 1 year ago

Yes, there is.

Here are two pre-trained models of the latest streaming zipformer that you can play with in sherpa-onnx:

danpovey commented 1 year ago

perhaps it is about sox commands normalizing max amplitude to 1, which cannot be done online?

On Monday, July 3, 2023, Fangjun Kuang @.***> wrote:

Yes, there is.

Here are two pre-trained models of the latest streaming zipformer that you can play with in sherpa-onnx:

-

Chinese: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/ online-transducer/zipformer-transducer-models.html# pkufool-icefall-asr-zipformer-streaming-wenetspeech-20230615-chinese https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#pkufool-icefall-asr-zipformer-streaming-wenetspeech-20230615-chinese

English: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/ online-transducer/zipformer-transducer-models.html# csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1006#issuecomment-1618495150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6LRVNJAHADSBWBZI3XOLLDFANCNFSM6AAAAAAXA7AUMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AdolfVonKleist commented 1 year ago

I have finished re-training a large model with the new zipformer and can confirm the observation here:

that this resolves the imbalance related to insertions vs deletions; the deletions issue appears to be resolved by a combination of this update, and taking some care with the padding in the case of the streaming, and streaming onnx models. The average accuracy is also improved. BTW I continue to see streaming accuracy converge very closely to non-streaming when the chunk size (now chunk+left context in the new zipformer) are maxed out to 512-1024 for large corpora and long audio. It's a really simple alternative to chunking and realigning for long audio processing and might be worth considering for some of the users looking into that.

I'm still not 100% satisfied that I've sussed out the normalization but for now upgrading to the new zipformer is more than enough. Thanks for all the feedback on this one, and the great work as always.

LoganLiu66 commented 1 year ago

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

csukuangfj commented 1 year ago

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

LoganLiu66 commented 1 year ago

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

No, when I try to use streaming_decode.py for decoding, I find worse result than decode.py (about absolute 2% on my own data). I also try to export the onnx using export-onnx.py and test using onnx_pretrained.py, it gets about absolute 3% decrease compared to decode.py.

csukuangfj commented 1 year ago

How do you invoke decode.py?

LoganLiu66 commented 1 year ago
python ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 999 \
--avg 1 \
--use-averaged-model 0 \
--beam-size 4 \
--exp-dir ${exp_dir} \
--lang-dir ${lang_dir} \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method greedy_search
csukuangfj commented 1 year ago

@yaozengwei

could you take a look?

Does decode.py by default inference in a non-streaming way with a streaming model?

yaozengwei commented 1 year ago

I suggest to use the lastest recipe zipformer instead. In the old recipe pruned_transducer_stateless7_streaming, there might be some issues when doing the chunk-wise forward for the first chunks, since we did not mask out the initial zero states.

How to fix this if I want to used pruned_transducer_stateless7_streaming for model export.

Are you using sherpa or sherpa-ncnn or sherpa-onnx?

No, when I try to use streaming_decode.py for decoding, I find worse result than decode.py (about absolute 2% on my own data). I also try to export the onnx using export-onnx.py and test using onnx_pretrained.py, it gets about absolute 3% decrease compared to decode.py.

Is there any clear error pattern when using streaming_decode.py? If it has more tail deletions, you could try a larger tail_pad_len, e.g., double decode_chunk_len, in https://github.com/k2-fsa/icefall/blob/8fcadb68a7cde093069e89830832e1ac728338fe/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/streaming_decode.py#L353

LoganLiu66 commented 1 year ago

It seems to be an overall deterioration in performance. decode.py

%WER = 5.48
Errors: 46 insertions, 99 deletions, 231 substitutions, over 6862 reference words (6532 correct)

streaming_decode.py

%WER = 8.48
Errors: 101 insertions, 178 deletions, 303 substitutions, over 6862 reference words (6381 correct)

Moreover, I tested decode.py and streaming_decode.py on librispeech, it got the same WER. But when I switch to my own dataset, it got the above results.

I have change https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/decode_stream.py#L85 to

self.hyp = [-1] * (params.context_size - 1) + [params.blank_id]

because I got 40+% WER when inited with

self.hyp = [params.blank_id] * params.context_size

csukuangfj commented 1 year ago

please have a look at the errs-* file and see if there are any error patterns.

LoganLiu66 commented 1 year ago

please have a look at the errs-* file and see if there are any error patterns.

It doesn't seem to have an error pattern.

danpovey commented 1 year ago

How old is your k2? I think we may have fixed some errors in the last few months.

LoganLiu66 commented 1 year ago

The k2 version is

k2 version: 1.24.3
Build type: Release
Git SHA1: b835546b6005d243865e0acc3d29bd9c51670b1e
Git date: Wed Jul 26 11:29:59 2023
Cuda used to build k2: 11.3
cuDNN used to build k2: 8.2.0
Python version used to build k2: 3.7
OS used to build k2: 
CMake version: 3.18.0
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_80,code=sm_80 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.12.1
PyTorch is using Cuda: 11.3
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /opt/conda/lib/python3.7/site-packages/k2-1.24.3.dev20230802+cuda11.3.torch1.12.1-py3.7-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /opt/conda/lib/python3.7/site-packages/k2-1.24.3.dev20230802+cuda11.3.torch1.12.1-py3.7-linux-x86_64.egg/_k2.cpython-37m-x86_64-linux-gnu.so
danpovey commented 1 year ago

Lots of deletions? Is the audio quiet? Guys, do we have relevant fixes since July 26?

On Fri, Sep 1, 2023, 9:40 AM Liuliuliu @.***> wrote:

The k2 version is

k2 version: 1.24.3 Build type: Release Git SHA1: b835546b6005d243865e0acc3d29bd9c51670b1e Git date: Wed Jul 26 11:29:59 2023 Cuda used to build k2: 11.3 cuDNN used to build k2: 8.2.0 Python version used to build k2: 3.7 OS used to build k2: CMake version: 3.18.0 GCC version: 7.5.0 CMAKE_CUDA_FLAGS: -Wno-deprecated-gpu-targets -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_80,code=sm_80 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-strict-overflow --compiler-options -Wno-unknown-pragmas CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable -Wno-strict-overflow PyTorch version used to build k2: 1.12.1 PyTorch is using Cuda: 11.3 NVTX enabled: True With CUDA: True Disable debug: True Sync kernels : False Disable checks: False Max cpu memory allocate: 214748364800 bytes (or 200.0 GB) k2 abort: False file: /opt/conda/lib/python3.7/site-packages/k2-1.24.3.dev20230802+cuda11.3.torch1.12.1-py3.7-linux-x86_64.egg/k2/version/version.py _k2.file: /opt/conda/lib/python3.7/site-packages/k2-1.24.3.dev20230802+cuda11.3.torch1.12.1-py3.7-linux-x86_64.egg/_k2.cpython-37m-x86_64-linux-gnu.so

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1006#issuecomment-1701992995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6KLNPCHPKYODEMACTXYE4H5ANCNFSM6AAAAAAXA7AUMI . You are receiving this because you commented.Message ID: @.***>

csukuangfj commented 1 year ago

No, we haven't had new changes for k2 since July 26.

Screenshot 2023-09-01 at 10 57 08