Add preprocessing scripts for the librilight datasets

HarryHe11 commented 9 months ago

✨ Description

This update introduces preprocessing scripts for the Libri-Light datasets, enhancing their usability and compatibility with our processing workflows.

🚧 Related Issues

No related issues.

👨‍💻 Changes Proposed

[x] Implemented the main workflow preprocessors/librilight.py for preprocessing Libri-Light datasets.
[x] Developed a utils/cut_by_vad.py script to segment audio files using multiprocessing (Step 1: Segmentation).
[x] Created an utils/mfa_prepare.py script to convert audio files to 16kHz and 16-bit PCM, and to filter out longer audio files (Steps 2 & 3: Filter and Preprocess).
[x] Added utils/whisper_transcription.py for audio transcriptions using distilled-whisper – a more efficient variant of Whisper, and included text preprocessing functions for these transcriptions (Steps 4 & 5: Transcription & Text-Preprocess).
[x] Integrated an MFA alignment function specifically tailored for Libri-Light in preprocessors/librilight.py(Step 6: Alignment).
[x] Enabled data splitting into train/dev/eval sets, along with statistics calculation and metadata construction for Libri-Light in preprocessors/librilight.py (Steps 7-9).
[x] Provided support for different subsets of Libri-Light, including "tiny", "small", "medium", and "large".

🧑‍🤝‍🧑 Who Can Review?

@lmxue
@RMSnow
@HeCheng0625

🛠 TODO

[x] Test on Libri-Light-tiny (custom split from Libri-Light-small).
[ ] Test on Libri-Light-small.
[ ] Test on Libri-Light-medium.
[ ] Test on Libri-Light-large.

✅ Checklist

[x] Code has been reviewed
[x] Code complies with the project's code standards and best practices
[x] Code has passed all tests
[x] Code does not affect the normal use of existing features
[x] Code has been commented properly
[x] Documentation has been updated (if applicable)
[x] Demo/checkpoint has been attached (if applicable)

HarryHe11 commented 9 months ago

In this comment, I provide screenshots from testing the implemented scripts on Libri-Light-tiny (a custom split from Libri-Light-small).

The Running Process

preprocessors/librilight.py

Part 1 of 5:

1321705224117_ pic

Part 2 of 5:

1331705224117_ pic

Part 3 of 5:

1341705224118_ pic

Part 4 of 5:

1351705224119_ pic

Part 5 of 5:

1361705224120_ pic

The Outcome:

Processed Data:

1381705225355_ pic

MetaData:

1371705225354_ pic

HarryHe11 commented 9 months ago

Thanks for your efforts. Please check out the comments

Thank you so much for reading my PR; I have addressed your concerns, and please see my most recent commits.

lmxue commented 9 months ago

LGTM. P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

HarryHe11 commented 9 months ago

LGTM.

P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.

sure, I test them then.

open-mmlab / Amphion