lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
935 stars 214 forks source link

Add GigaSpeech 2 recipe #1365

Open yfyeung opened 3 months ago

yfyeung commented 3 months ago

This PR adds a recipe for GigaSpeech 2. GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refined consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. GigaSpeech 2 test sets more realistically reflect speech recognition scenarios and mirror the real performance of an ASR system for low-resource languages.

For more details, please visit: Dataset: https://huggingface.co/datasets/speechcolab/gigaspeech2 Preprint paper: https://arxiv.org/pdf/2406.11546

yfyeung commented 3 months ago

Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:

https://github.com/lhotse-speech/lhotse/blob/da4d70d7affc477eb8dc3a51f9b13d387817059a/lhotse/recipes/gigaspeech.py#L96-L129

Sure, I will implement this later.