Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set

mubtasimahasan commented 1 week ago

I am attempting to prepare the Multilingual LibriSpeech (MLS) dataset using the lhotse.recipes.mls:

lhotse prepare mls $corpus_dir $output_dir --flac --num-jobs 40

$corpus_dir contains only the mls_english directory.
$output_dir is the directory where I expect the output manifests to be saved.

After running this command for more than 72 hours, the process seems to be stuck. I can see the following files in the $output_dir:

mls-english_recordings_dev.jsonl.gz
mls-english_recordings_test.jsonl.gz
mls-english_supervisions_dev.jsonl.gz
mls-english_supervisions_test.jsonl.gz

However, the following files are missing:

mls-english_recordings_train.jsonl.gz
mls-english_supervisions_train.jsonl.gz

The output log shows the process stuck at:

Scanning audio files (*.flac): 10807259it [15:50, 7377.79it/s]

This has been the status since the very beginning, and there doesn't seem to be any further progress.

Questions:

How can I resolve this issue?
The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?
Is my use of an HDD causing slow processing?
I am using an HDD for storage, and the train set of the mls_english subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?
Is there a way to speed up manifest preparation for large datasets?
Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?

Any guidance on these issues would be greatly appreciated! Thank you for your help.

pzelasko commented 6 days ago

The MLS recipe was the first one we added for very large datasets, and it's implemented less efficiently than others. You'd need to modify it to use incremental manifest writers so it avoids blowing up CPU memory. See how it's done in GigaSpeech recipe for example https://github.com/lhotse-speech/lhotse/blob/a30720b8329676a92ced850d941d45a352df5bb7/lhotse/recipes/gigaspeech.py#L96-L102

pzelasko commented 6 days ago

But yeah generally expect it to take a while as English MLS is quite sizeable. It may be possible to implement it differently to accommodate distributed compute environments and speed up if you do sth like process directory per single worker and write to chunks instead of a single manifest.

lhotse-speech / lhotse

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Questions: