lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
908 stars 205 forks source link

Feature extraction is slow because of slow job submittal #1215

Closed RuABraun closed 8 months ago

RuABraun commented 8 months ago

I'm calling compute_and_store_features() with a slurm executor and by default it runs very slowly because the jobs take a long time (many minutes) to get submitted.

If I change this line to

cut_sets = self.split(num_jobs)

The job submittal is instant.

I would expect the existing implementation to be slower since it's iterating across the entire original cutset num_job times (rather than just once), but not orders of magnitudes slower. Wondering if there's something I'm missing, and if we could update the code (I'm willing make a PR) to make it faster (open to another approach).

pzelasko commented 8 months ago

I remember it used to be split in the past, but there was some issue with it, I just can't remember what it was. I think the best approach might be to split it first and run a job array with SLURM processing each chunk separately, and then re-combine.

RuABraun commented 8 months ago

Yeah fair. Nicer to run a job array I'll do that.

One issue I noticed with split() is that if you have more jobs than files it will crash, whereas LazySlicer didn't have that issue.