From my understanding, the logic in the for loop is
If either:
Adding the current utterance to audio_sample exceeds 30s
The current speaker is different from previous (prev_speaker)
Then save the concatenation up to the previous utterance (audio_sample), excluding the current utterance.
Since the concatenated sample does not contain the current utterance, we have:
The appended speaker should be previous_speaker rather than speaker
condition_on_prev signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize as condition_on_prev = [0])
Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample) pair that is <= 30s which should've been appended but didn't.
These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.
In
concatenate_dataset()
: https://github.com/huggingface/distil-whisper/blob/66ac8dd94963d08c28b868d6e1eeb328aab57c8b/training/run_pseudo_labelling.py#L644-L671From my understanding, the logic in the for loop is
audio_sample
exceeds 30sspeaker
is different from previous (prev_speaker
)audio_sample
), excluding the current utterance.Since the concatenated sample does not contain the current utterance, we have:
previous_speaker
rather thanspeaker
condition_on_prev
signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize ascondition_on_prev = [0]
)Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a
(audio_sample, text_sample)
pair that is <= 30s which should've been appended but didn't.These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.