Closed HarryHe11 closed 9 months ago
In this comment, I provide screenshots from testing the implemented scripts on Libri-Light-tiny (a custom split from Libri-Light-small).
The Running Process
preprocessors/librilight.py
Part 1 of 5:
Part 2 of 5:
Part 3 of 5:
Part 4 of 5:
Part 5 of 5:
The Outcome:
Processed Data:
MetaData:
Thanks for your efforts. Please check out the comments
Thank you so much for reading my PR; I have addressed your concerns, and please see my most recent commits.
LGTM. P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.
LGTM.
P.S. This PR has been tested on Libri-Light-tiny. However, three other subdatasets need to be tested as listed in TODO. You may need to test them when the dataset is ready.
sure, I test them then.
β¨ Description
This update introduces preprocessing scripts for the Libri-Light datasets, enhancing their usability and compatibility with our processing workflows.
π§ Related Issues
No related issues.
π¨βπ» Changes Proposed
preprocessors/librilight.py
for preprocessing Libri-Light datasets.utils/cut_by_vad.py
script to segment audio files using multiprocessing (Step 1: Segmentation).utils/mfa_prepare.py
script to convert audio files to 16kHz and 16-bit PCM, and to filter out longer audio files (Steps 2 & 3: Filter and Preprocess).utils/whisper_transcription.py
for audio transcriptions using distilled-whisper β a more efficient variant of Whisper, and included text preprocessing functions for these transcriptions (Steps 4 & 5: Transcription & Text-Preprocess).preprocessors/librilight.py
(Step 6: Alignment).preprocessors/librilight.py
(Steps 7-9).π§βπ€βπ§ Who Can Review?
π TODO
β Checklist