Problems in concatenate_dataset

In concatenate_dataset(): https://github.com/huggingface/distil-whisper/blob/66ac8dd94963d08c28b868d6e1eeb328aab57c8b/training/run_pseudo_labelling.py#L644-L671

From my understanding, the logic in the for loop is

If either:
1. Adding the current utterance to audio_sample exceeds 30s
2. The current speaker is different from previous (prev_speaker)
Then save the concatenation up to the previous utterance (audio_sample), excluding the current utterance.

Since the concatenated sample does not contain the current utterance, we have:

The appended speaker should be previous_speaker rather than speaker
condition_on_prev signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize as condition_on_prev = [0])

Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample) pair that is <= 30s which should've been appended but didn't.

These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.

huggingface / distil-whisper

Problems in concatenate_dataset #129