lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
944 stars 216 forks source link

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Open mubtasimahasan opened 1 week ago

mubtasimahasan commented 1 week ago

I am attempting to prepare the Multilingual LibriSpeech (MLS) dataset using the lhotse.recipes.mls:

lhotse prepare mls $corpus_dir $output_dir --flac --num-jobs 40

After running this command for more than 72 hours, the process seems to be stuck. I can see the following files in the $output_dir:

However, the following files are missing:

The output log shows the process stuck at:

Scanning audio files (*.flac): 10807259it [15:50, 7377.79it/s]

This has been the status since the very beginning, and there doesn't seem to be any further progress.

Questions:

  1. How can I resolve this issue?
    The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?

  2. Is my use of an HDD causing slow processing?
    I am using an HDD for storage, and the train set of the mls_english subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?

  3. Is there a way to speed up manifest preparation for large datasets?
    Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?

Any guidance on these issues would be greatly appreciated! Thank you for your help.

pzelasko commented 6 days ago

The MLS recipe was the first one we added for very large datasets, and it's implemented less efficiently than others. You'd need to modify it to use incremental manifest writers so it avoids blowing up CPU memory. See how it's done in GigaSpeech recipe for example https://github.com/lhotse-speech/lhotse/blob/a30720b8329676a92ced850d941d45a352df5bb7/lhotse/recipes/gigaspeech.py#L96-L102

pzelasko commented 6 days ago

But yeah generally expect it to take a while as English MLS is quite sizeable. It may be possible to implement it differently to accommodate distributed compute environments and speed up if you do sth like process directory per single worker and write to chunks instead of a single manifest.