lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
902 stars 204 forks source link

Feature calculation process crashing with large dataset #1364

Open duhtapioca opened 1 week ago

duhtapioca commented 1 week ago

Hi,

We're trying to calculate features for a ~20k hour dataset using kaldifeatfbank extractor and the process keeps crashing when the num_workers are higher than 0. This is the code we're trying to run

extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda', frame_opts=KaldifeatFrameOptions(sampling_rate=8000)))

cuts_train = cuts_train.compute_and_store_features_batch(
     extractor=extractor,
     storage_path = "/temp_ssd/icefall/egs/librispeech/ASR/data/features_kaldifeatfbank/",
     num_workers = 46,
)

The initial error was RuntimeError: received 0 items of ancdata. After going through the issue https://github.com/k2-fsa/icefall/issues/515 and https://github.com/pytorch/pytorch/issues/973, we tried two suggested solutions which did not work -

  1. Increasing ulimit soft and hard limit to 1024000 which stopped the initial error but was crashing later with RuntimeError: unable to mmap 111360 bytes from file <filename not specified>: Cannot allocate memory (12).

  2. Increasing ulimit and setting torch.multiprocessing.set_sharing_strategy('file_system') which also crashed with a similar error RuntimeError: unable to mmap 202944 bytes from file </torch_57144_2565496321_1982>: Cannot allocate memory (12)

These errors suggest that there's a lot of memory leakage happening leading to high ram usage and the consequent crash.

With num_workers=0, 18 hours to calculate 14 million files with no memory leaks and with num_workers=46, it took 8 hours to calculate 12 million files with memory leaks.

How do we avoid this memory usage increase and crash? batch_duration modifications don't seem to change the vram usage. Acc. to this comment, we are supposed to set the strategy for each worker via a worker_init_fn, how do we go about doing that in this case? Also, this comment seem to point the root cause for this issue. Is there something obvious we're missing in our approach that we should take for large datasets?

Any suggestions or guidance regarding this would be of great help.

Thank you!

pzelasko commented 1 week ago

I can't see any obvious part of the code that would cause memory leaks. We are not storing the objects returned from the dataloader - the results are processed and submitted to a file writing thread. Perhaps with so many workers, the file writing thread queue is growing indefinitely?

You could try to add a hack which replaces

                futures.append(executor.submit(_save_worker, cuts, features))

with

                while executor._work_queue.qsize() > 1000:
                    time.sleep(0.1)
                futures.append(executor.submit(_save_worker, cuts, features))

to let the writer thread "catch up". If this helps, we can either 1) refactor to implement a proper sized queue 2) remove the writer thread altogether and accept the inefficiency 3) refactor to use a process pool executor for writing and assign more workers to it

If this doesn't help, another option is to go paranoia mode and explicitly delete cuts, audio, and feature tensors after they are processed with del.