lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
902 stars 204 forks source link

Experiencing memory leakage with MixedCut.load_audio #1333

Closed hereismohsen closed 1 month ago

hereismohsen commented 1 month ago

For debugging purposes, I'm working with a very small dataset (180 training samples, and 30 validation samples all of which are below 30s). I load my manifest lazily and want to do processing on the fly. During training, my RAM memory(68.7G) is getting full until the program causes OOM error and it cannot even fully train on a single batch. I tried to find the source of this issue and it seems that it's related to the _MixedCut.loadaudio() function (just loading the cut without extracting feature) or any function that uses it. When I comment _MixedCut.loadaudio functions in getitem() of my dataset, the problem resolves. I tried using different samplers (DynamicBucketingSampler or SimpleCutSampler) and was still there. Decreasing the num_worker reduces the growing rate of the memory a little bit but according to my dataset, the memory growth (at least 20 to 30 GB) should not occur at all.

hereismohsen commented 1 month ago

I think I found out what was the problem. I'd written a custom transform class like CutReverb. In this class I'd put this line and my memory profile was showing something like this: temp_cut=temp_cut.perturb_speed(rng.uniform(0.9, 1.1)) uniform

When I changed the line to this the problem was solved: temp_cut=temp_cut.perturb_speed(rng.choice([0.9, 1.1])) choice

I don't know why perturbing a cut's speed with high precision floats (unlike exact floats like 1.1) tasks this much memory and whether it is fixable or not.

pzelasko commented 1 month ago

Speed perturbation uses resampling. In order to optimize resampling, we cache the resampling kernel coefficients to be re-used across invocations. If you provide a different speed perturb factor every time, it will have to compute a new resampling kernel for that. https://github.com/lhotse-speech/lhotse/blob/4f014b13202c724d484e0471343053a261487b8a/lhotse/augmentation/torchaudio.py#L69-L83

If you want to keep randomized speed perturb factor, you’d have to disable the caching (there is no option to disable it right now).

As a side note, resampling with very large ratios between source and target SR is difficult as the kernels may become very large.