Open yfyeung opened 3 months ago
Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:
Sure, I will implement this later.
This PR adds a recipe for GigaSpeech 2. GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refined consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. GigaSpeech 2 test sets more realistically reflect speech recognition scenarios and mirror the real performance of an ASR system for low-resource languages.
For more details, please visit: Dataset: https://huggingface.co/datasets/speechcolab/gigaspeech2 Preprint paper: https://arxiv.org/pdf/2406.11546