k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
902 stars 287 forks source link

Zipformer performs well on short audios but terribly on long audios with silences #1680

Closed duhtapioca closed 2 months ago

duhtapioca commented 3 months ago

Hi,

We trained a Zipformer model with approximately 20k hours of Hindi audio data, containing files ranging between 2-14 seconds. The test data consists of longer audio files with extended periods of silence, as these recordings are sourced from splitting audios containing conversations, with each channel containing a single speaker's voice.

We tried decoding one of these longer files with the base model and it was only decoding 2-3 words from a 6-minute audio file containing a lot of speech between long silences. When only the voice segments of the audio file were being decoded after doing VAD, the transcription was much better. We also decoded a similar but 1 minute-long audio and its transcription was good.

Is there a way to improve the performance on files such as this other than VAD during inference or fine-tuning the base model with a dataset containing audios similar to the test data?

Thank you!

csukuangfj commented 3 months ago

it was only decoding 2-3 words from a 6-minute audio file containing a lot of speech between long silences.

Are you using a streaming zipformer or a non-streaming one?

If you are using a non-streaming zipformer, a wave of 6 minutes or 1 minute is too long. I recommend using a streaming zipformer or limiting your input wave to less than 20 seconds for a non-streaming model.

duhtapioca commented 3 months ago

We're using a non-streaming zipformer. Can you give a high-level idea of why a streaming model would be suitable for this case?

We were planning to fine-tune the model we trained on 100 hours of long audios containing silence (ie the target audios like 1/6 min one mentioned above), can this fine-tuning help overcome the issue of the non-streaming model not being able to decode long audios or should we train a streaming base model first and fine-tune on that for ideal results?

csukuangfj commented 3 months ago

I assume you know that whisper is a non-streaming model and it limits its input to 30 seconds.

A streaming model processes an input wave chunk by chunk. It does not limit how long the input wave is.

danpovey commented 3 months ago

I would expect that including data with long silences in the training data would make it more robust to long silences. (Also having longer utterances in the training data would make it more robust to longer test utterances.)

duhtapioca commented 3 months ago

@csukuangfj I understand streaming/causal model implies realtime/online decoding. Is streaming zipformer better suited for longer audios even for offline decoding? ie the complete wav file available. If so, what's the intuition behind it?

@danpovey, once I have a base model (will train a streaming one as recommended) I plan to finetune on 100hrs of data, is that a good start?

danpovey commented 3 months ago

Sure, fine-tuning on a smaller amount of data should be fine, just use a not-too-large learning rate. I'm not super confident that a streaming model would solve the robustness issues if the model doesn't deal well with long silences. Personally for long audios I would normally not decode more than 30 second chunks. There may be an example script in our text_search repository somewhere.

duhtapioca commented 3 months ago

Personally for long audios I would normally not decode more than 30 second chunks. There may be an example script in our text_search repository somewhere.

@danpovey, I will look for text_search and try to find anything that could be relevant, but how do you propose "would normally not decode more than 30 second chunks"? Should I split my audio files by vad at inference time and then decode?

Do you think that's a more widely used setup? Because if so, instead of finetuning on the 100 hr target domain directly, I should probably extract voiced segments from that dataset finetune the base model on that, and during inference time trim silence and then decode.

danpovey commented 3 months ago

Normally we just split into 30-second slightly overlapping segments and decode all of them, it avoids the need for a separate VAD.

duhtapioca commented 2 months ago

Normally we just split into 30-second slightly overlapping segments and decode all of them

@danpovey, is there a script available that does that? if we are using these 30 seconds overlapping chunks to decode, in this case when finetuning our base model, should we finetune it on complete audios(which are up to many minutes long), or finetune on voiced segments.

@csukuangfj can you share your perspective too?

csukuangfj commented 2 months ago

is there a script available that does that?

Please see https://github.com/k2-fsa/icefall/pull/980


should we finetune it on complete audios

As Dan suggested, please don't use very long audio as input for your non-streaming model. This holds not only for inference but also for training and fine-tuning.

If you use a 6-minute long audio during training or fine-tuning, it is very likely that you get OOM errors.

duhtapioca commented 2 months ago

@csukuangfj, With our current set-up, we are able to go up to 3000s max duration. A 6-minute file is well under 3k seconds, can you give a high-level explanation on what could cause OOM in this case when including long audios in the training data?

pzelasko commented 2 months ago

Roughly speaking, scalability in batch dimension is ~linear, but scalability in sequence length dimension is ~quadratic.

duhtapioca commented 2 months ago

@pzelasko, We are able to get the dataset containing files ranging between 15s-6m working for finetuning without any OOMs with 500s max duration (streaming). Can we expect good results if we keep training with this or do we need to split these audios into smaller chunks to get good results?

Are long audios generally problematic for fine-tuning other than their potential to cause OOMs? Also a few more general questions.

  1. in Kaldi, there are special symbols for silences (SIL), how is that represented in icefall?
  2. Is CTC better for our use case over RNNT if we want to fine-tune a base streaming model (trained on shorter/no silence utterances) on longer audios with silence to make it robust to silences?
pzelasko commented 2 months ago

I never trained on utterances longer than ~40s so I can only share my suspicions. I’d expect with longer examples it may be more difficult for the model to find a good alignment in either CTC or RNNT objective. But since you are fine tuning, the base model may be able to handle it.

Regarding silence tokens a typical CTC / RNNT setup doesn’t model that explicitly, and instead just predicts blanks.

There’s also the non trivial effect of the batch size. If you have to sacrifice the batch size in order to train on longer utterances you may need to train for more steps / use gradient accumulation / tweak LR for optimal results.

duhtapioca commented 2 months ago

Hi @pzelasko,

We are planning to segment the long audio dataset (4 min-15 min) into smaller chunks to finetune the base streaming zipformer in order to make it work better with long audio files with silences. We have accurate forced alignments and vad to leverage. Earlier, we looked into fine-tuning with long audios directly but we were losing a lot of data over 6 min because our set-up was only able to handle up to 6 min before going OOM. Can we get your advice on some doubts?

  1. For splitting the long audio files into chunks, what's the better option from the following? a) Get chunks containing voice and expand them to include as much silence as possible while it remains under 30-40 seconds. - Outcome: utterances ranging from a few secs to 40 secs, with silence in between. b) Segment 30-second chunks with slight overlaps. Outcome: uniform chunks of 30 secs.

  2. In the case of the streaming model, is decode.py fine for inference ideal or should we opt for the long_file_recog logic - 30s chunks with 1-2s overlap (which helped the non-streaming model perform better)? For this case, option b) of chunking the training data into 30s could be better because the inference data will also contain 30s chunks.

pzelasko commented 2 months ago

Would probably go with your strategy for fixed 30s chunks in both training and inference - consistent and simple..