huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Problems in concatenate_dataset #129

Closed George0828Zhang closed 2 weeks ago

George0828Zhang commented 1 month ago

In concatenate_dataset(): https://github.com/huggingface/distil-whisper/blob/66ac8dd94963d08c28b868d6e1eeb328aab57c8b/training/run_pseudo_labelling.py#L644-L671

From my understanding, the logic in the for loop is

Since the concatenated sample does not contain the current utterance, we have:

  1. The appended speaker should be previous_speaker rather than speaker
  2. condition_on_prev signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize as condition_on_prev = [0])

Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample) pair that is <= 30s which should've been appended but didn't.

These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.