Errors noticed after extensively testing Zipformer model

kafan1986 commented 8 months ago

I have extensively used Zipformer model (both streaming and non-streaming variant) and I have noticed the following errors. The test has been done with greedy search and as well as higher beam size values but no LM. The error below are for the non-streaming variant which should reach highest accuracy.

a) Sometimes incorrect prediction at the start as well as at the end of the audio segment b) Sometime clearly audible words get deleted (not predicted at all) - Most critical and occurrence count is significant c) Two separate words gets conjoined sometimes, either with their correct spelling ("good morning" => "goodmorning") or the conjoined word have some weird spelling ("good morning" => "goodorning")

Overall accuracy is slightly ahead of the "conformer" ASR model. The main advantage of Zipformer is training speed up. There is atleast a 5x speed up in training time requirement compared to conformer. Also, zipformer ASR model seems to be more phonetic and can do a better job when predicted out-of-vocabulary words.

If the error a), b) and c) reported above can be improved, specially b) then Zipformer model can be state of the art for its size.

danpovey commented 8 months ago

Did you train this on your own data, and do you have Conformer-based baselines that don't show these issues, or at least that the issues are significantly less frequent in? It would be interesting to see some kind of numerical comparison.
My first suspicion would be the training data if I saw these types of problems.

Also are you comparing across just using different types of encoder (Conformer vs Zipformer), or were there changes in the loss function and decoding method?

kafan1986 commented 8 months ago

Did you train this on your own data, and do you have Conformer-based baselines that don't show these issues, or at least that the issues are significantly less frequent in? It would be interesting to see some kind of numerical comparison. My first suspicion would be the training data if I saw these types of problems.

Also are you comparing across just using different types of encoder (Conformer vs Zipformer), or were there changes in the loss function and decoding method?

Yes. Both conformer and Zipformer models are trained on same dataset (my own) which is inherently quite noisy and challenging. And yes Conformer does not suffer from all 3 mentioned categories of errors but in spite of these errors, over all accuracy of Zipformer is slightly better than Conformer's accuracy. Category (a) error are least, category (c) error happens often, but category (b) error is the most critical and puzzling as those segment sometimes have quite decent and audible pronunciation of word(s) and the model would simply not decode the one or more word(s). Rare occurrence is when entire audio segment (say around 4-5 second long) won't generate any transcript at all.

csukuangfj commented 8 months ago

Also are you comparing across just using different types of encoder (Conformer vs Zipformer), or were there changes in the loss function and decoding method?

@kafan1986 Could you please also answer this question?

ezerhouni commented 8 months ago

@csukuangfj @kafan1986 I am experiencing the same thing especially the b) part. In my case, yes everything is the same.

joazoa commented 8 months ago

I also mostly see excessive deletions in both offline and streaming model, not just words, complete phrases or parts of phrases are getting ignored. The streaming version very often first correctly recognizes what is being said, then deletes it. Also common: the models not being able to handle m ore than 2 consecutive small works. (as in single token words).

csukuangfj commented 8 months ago

For the deletion errors, have you guys tried the fix in https://github.com/k2-fsa/icefall/pull/1447

joazoa commented 8 months ago

@csukuangfj I am using the sherpa-onnx online websockets for my tests, latest master, which i think is not affected ?

danpovey commented 8 months ago

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

joazoa commented 8 months ago

I did a quick test by removing a second from the (longer file) with audacity, but the same phrase was still deleted. (with 32-256) - the missing phrase is 9 seconds long.

I then tried something else: When i use smaller context (128), a different (smaller) phrase is missing. When i use a smaller chunk size (16) and 128 context, all phrases are present.

(this was just a quick test one 1 sample where I know I had a problem)

nshmyrev commented 8 months ago

Similar here, in streaming mode I see many deletions with 64-256 zipformers. 16-64 zipformer is much better. It also helps to delete silence chunks between phrases. My models are trained with musan augmentation, it seems that search never leaves blank.

joazoa commented 8 months ago

I also have the impression that sometimes things get worse after a period of silence. (when testing live on the webdemo) I forgot to mention, i am using modified beamsearch.

danpovey commented 8 months ago

@nshmyrev these are just differences in decoding settings you are comparing, right, not different models?

danpovey commented 8 months ago

It would be good to vary the chunk size & context independently to see whether one or the other is more responsible for the differences.

joazoa commented 8 months ago

@danpovey correct, model is the same, just different onnx export with different chunk and context settings. The chunk size seems to have the bigger impact for me at first glance, although both make a difference, I will do some more tests a bit later.

nshmyrev commented 8 months ago

@nshmyrev these are just differences in decoding settings you are comparing, right, not different models?

Right, same model, just different chunk sizes after export to onnx.

I'll prepare more detailed test a bit later, wanted to convert librispeech to streaming test to showcase this.

danpovey commented 8 months ago

I suspect that in your training sets, some of the data had largish deletions in the transcript; and the model learned: "if the left-context seems to be wrong, continue to output nothing until we get to a silence."
I have suspected in the past that some kind of randomization, with small probability, of the left-context symbols (the symbols used in the decoder/predictor) might resolve this kind of issue, as it will force the model to output something even if the left-context seems wrong. Possibly randomizing the left-context in chunks at least as long as the decoder/predictor history length might work best: for example, take a random position somewhere within each training batch, and randomize or permute the (say) 2 symbols starting at that position, within the batch, so that 2 symbols get moved from one batch element to the other. It would be interesting to see whether this might help, anyway.

joazoa commented 8 months ago

@danpovey i think that might indeed be the issue., I have found such data before and tried to filter the, but i'm sure there will be such cases left. Unfortunately I won't be able to implement that change, but I could test it if somebody else can.

videodanchik commented 8 months ago

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality).

I also found that you can rebalance deletions and insertions to make deletions rarer by using this blank_penalty https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75 that just subtructs some number from the log brob that corresponds to blank. But this introduces another hyper-parameter that you'll have to tune against your domain and looks more like a hack rather than the cure of the core reason. I wonder if anyone else tried to tweak this blank_penalty?

ezerhouni commented 8 months ago

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality).

I also found that you can rebalance deletions and insertions to make deletions rarer by using this blank_penalty

https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75

that just subtructs some number from the log brob that corresponds to blank. But this introduces another hyper-parameter that you'll have to tune against your domain and looks more like a hack rather than the cure of the core reason. I wonder if anyone else tried to tweak this blank_penalty?

I haven't tried to shift the frame but the blank_penalty does improve quite a bit the WER.

danpovey commented 8 months ago

We could maybe try a heuristic of increasing the blank penalty as number of successive blanks rises. It's a bit ugly but might help address deletions after silences. Maybe the model never sees long silence in training so gets confused. We could try occasionally adding segments with lots of silence, perhaps, in training.

On Thu, Jan 18, 2024, 12:50 AM Erwan Zerhouni @.***> wrote:

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality).

I also found that you can rebalance deletions and insertions to make deletions rarer by using this blank_penalty

https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75

that just subtructs some number from the log brob that corresponds to blank. But this introduces another hyper-parameter that you'll have to tune against your domain and looks more like a hack rather than the cure of the core reason. I wonder if anyone else tried to tweak this blank_penalty?

I haven't tried to shift the frame but the blank_penalty does improve quite a bit the WER.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1465#issuecomment-1896206065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZ6G7QLEHPJ3IXALB3YO76MDAVCNFSM6AAAAABB55O72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWGIYDMMBWGU . You are receiving this because you were mentioned.Message ID: @.***>

danpovey commented 8 months ago

Also it would be nice to check that the issue is not specific to Sherpa/onnx.

On Thu, Jan 18, 2024, 1:34 AM Daniel Povey @.***> wrote:

We could maybe try a heuristic of increasing the blank penalty as number of successive blanks rises. It's a bit ugly but might help address deletions after silences. Maybe the model never sees long silence in training so gets confused. We could try occasionally adding segments with lots of silence, perhaps, in training.

On Thu, Jan 18, 2024, 12:50 AM Erwan Zerhouni @.***> wrote:

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality).

I also found that you can rebalance deletions and insertions to make deletions rarer by using this blank_penalty

https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75

that just subtructs some number from the log brob that corresponds to blank. But this introduces another hyper-parameter that you'll have to tune against your domain and looks more like a hack rather than the cure of the core reason. I wonder if anyone else tried to tweak this blank_penalty?

I haven't tried to shift the frame but the blank_penalty does improve quite a bit the WER.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1465#issuecomment-1896206065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZ6G7QLEHPJ3IXALB3YO76MDAVCNFSM6AAAAABB55O72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWGIYDMMBWGU . You are receiving this because you were mentioned.Message ID: @.***>

joazoa commented 8 months ago

@danpovey I've tried with a dataset with long silences within the transcripts, but it didn't resolve the issue.

kafan1986 commented 8 months ago

impact

I have used Efficient conformer for testing conformer ASR modelhttps://github.com/burchim/EfficientConformer and it gives better accuracy than vanilla conformer implementation. The loss used was RNNT for efficient conformer model. (The param size was ~10.7M), for the Zipformer I used default small configuration for training and decoding. (~23M param model)

kafan1986 commented 8 months ago

If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue.

For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality). I also found that you can rebalance deletions and insertions to make deletions rarer by using this blank_penalty https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75

that just subtructs some number from the log brob that corresponds to blank. But this introduces another hyper-parameter that you'll have to tune against your domain and looks more like a hack rather than the cure of the core reason. I wonder if anyone else tried to tweak this blank_penalty?

I haven't tried to shift the frame but the blank_penalty does improve quite a bit the WER.

What is the value of blank_penalty that you have used? And what was the WER with and with out this blank_penalty during your testing? As you have mentioned there was significant improvement with this.

joazoa commented 8 months ago

@danpovey The main issue i am seeing on the live web demo, is that : As long as i talk fluently (like reading an article) words get accurately transcribed, but when i just say random phrases into the microphone, with longer pauses, the words get properly transcribed, then removed. (phrases at once). Once this happens, it's very likely to keep going for a while, either not transcribing anything or transcribing correctly to then be removed.

My datasets contain a lot of silence before after and in the middle of transcripts, could it be that the model learnt that if a part is silence, the rest is most likely too ?

I intentionally trained on data with a lot of silence to reduce the model from outputting very common short words during silence. (like yes, or hi)

danpovey commented 8 months ago

@joazoa I'd be more concerned about whether there were chunks of speech without a corresponding transcription in your dataset.

Regarding this: "when i just say random phrases into the microphone, with longer pauses, the words get properly transcribed, then removed. (phrases at once). " that looks to me like an issue with the beam-search procedure. Why not just use greedy search?

joazoa commented 8 months ago

I confirm that greedy search works a lot better, thank you! @danpovey I no longer have the excessive corrections. I do still see the words getting ignored, and I seem to be able to trigger it by switching to a language that the model is not trained on or mumbling or moving my mic further. When i switch back to proper audio, very often no words are recognized for a while.

yuyun2000 commented 8 months ago

我曾经也遇到了这些问题，尤其是：1、长时间静音后突然识别，开头的几个字基本识别不正确；2、会漏掉一些单词。然而我清晰知道这些问题的原因，最原来的zipformer是有pooling层的，我当时做过实验，把pooling去掉，这两个现象就会发生，但是加上pooling层后，这些现象就会消失。问题在于，我发现后面更新的zipformer，和原始的zipformer结构略微不同，特别是新的zipformer原生没有pooling层，我不知道你们的原因是不是这个。

English: I've also encountered these issues before, especially: 1) After a long silence, when recognition suddenly starts, the first few words are almost always incorrectly recognized; 2) Some words get omitted. However, I clearly understand the reasons behind these problems. The original version of Zipformer had pooling layers, and based on my experiments, removing the pooling layers would result in these phenomena, but upon adding the pooling layers back, they would disappear. The problem is that I noticed the updated version of Zipformer slightly differs from the original structure, particularly, the new Zipformer natively does not have pooling layers. I'm not sure if this is the reason you guys are experiencing these issues.

nshmyrev commented 8 months ago

Relevant paper from Google on RNN-T on high deletion rate

RNN-T MODELS FAIL TO GENERALIZE TO OUT-OF-DOMAIN AUDIO: CAUSES AND SOLUTIONS https://arxiv.org/pdf/2005.03271.pdf

nshmyrev commented 8 months ago

Another issue from Nvidia TDT paper https://arxiv.org/abs/2304.06795

WER for repeated tokens goes as high as 60%

We notice that RNN-T models often suffer serious performance degradation when the text sequence has repetitions of the same tokens (repetition on the subword level to be exact if using subword tokenizations). Our investigation shows that more training data will not solve this issue, and this is an intrinsic issue of RNN-Ts.

danpovey commented 8 months ago

We're looking into the feasibility of switching over from RNN-T to TDT.

daniel-dona commented 4 months ago

I'm facing similar issues after training Zipformer2 with Commonvoice (including "other" cuts) for Spanish, I got WER % of less than 5 after 50 epochs and in general works pretty well. But sometimes seems "confused" and gets no output for >30s (not sure how to explain it, I'll try to record a video).

Tried using greedy_search and modified_beam_search, not much of a difference I would say. Also setting the blank-penalty parameter but not much luck with that either...

I'm trying now to train Zipformer with CTC attention head, any other thing that I can try?

danpovey commented 4 months ago

My guess is that, if it's not giving outputs for >30s input, this may be a generalization issue due to the training having very short utterances, i.e. nothing approaching 30 seconds. Perhaps you could try concatenating some CommonVoice utterances and their transcripts together, or including some longer utterances from some other source.

By the way, I think we found that TDT used too much memory to work well with our training setups, since our pruned RNN-T is quite optimized for memory usage and we use large batch sizes.

kafan1986 commented 4 months ago

@danpovey I have opposite experience. My training data has all utterances less than 20seconds. During inference, I have used segments upto 57 seconds and the accuracy is quite good if not better than shorter segments (less than 5-6 seconds).

Regarding your feedback with TDT, does it help with better OOV and deletion error?

joazoa commented 4 months ago

@kafan1986 when you say >30s, do you means samples over 30s or are you using streaming on longform audio ?

kafan1986 commented 4 months ago

Non-streaming variant during inference time. Entire audio segment in one go.

k2-fsa / icefall

Errors noticed after extensively testing Zipformer model #1465