[pseudo-labelling] fix concatenate datasets

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.32k stars 238 forks source link

[pseudo-labelling] fix concatenate datasets #138

Closed eustlb closed 2 weeks ago

eustlb commented 3 weeks ago

This PR includes changes to fix behavior concerning the --concatenate_audio flag. In particular, function concatenate_dataset now handle edge cases that before led to skipping last sample, associating wrong speaker and wrong condition_on_prev for first sample.