Memory usage - Githubissues

gadese commented 4 months ago

Hi! First off thanks a lot for your work, your model is great at separating voices from music and ambient noises.

I was wondering how you measured the peak memory usage for your metrics? What kind of audio did you use? I'm mostly wondering because I'm seeing much higher GPU memory consumption than you are.

If I'm running the dnr-3s-mus64-l1snd-plus model with a relatively small audio file (less than 25 Mb), I'm getting a CUDA OOM error on a GPU with 12 Gb RAM.

I've also tried running a 0.5Gb audio file on a Cloud machine with a 24Gb GPU and 64 Gb RAM, but I got this error

RuntimeError: [enforce fail at alloc_cpu.cpp:83] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11637108000 bytes. Error code 12 (Cannot allocate memory)

which seems a bit crazy to me considering the peak memory usage you showcase.

kwatcharasupat commented 4 months ago

Could you check the sample rate of your file? If it's not 44.1 kHz this might have triggered a resampler. It's okay for a short file but it's probably better to run an ffmpeg resampler beforehand if your file is that big.

kwatcharasupat commented 4 months ago

For the shorter audio file, you might want to adjust the "batch size" value. Individual chunks don't use up that much memory so that config file was set to maximize the amount I could fit into a 24 GB GPU. Halving it would probably do it.

gadese commented 4 months ago

Changing the inference batch_size did help lower the amount of GPU memory used, thanks.

For the CPU issue, my file is 44.1 kHz so resampling shouldn't be the issue. The memory usage seems to be spiking when the data is preprocessed in the BaseFader.forward() method from what I can tell.

I assume this is because all the preprocessed audio chunks are kept in memory rather than passed one batch at a time via a generator. With a 0.5Gb input audio (around 1h30m) I'm seeing memory consumption of up to 55Gb so this seems to be a big limitation of the current implementation.

I'm not sure I'll have time soon to open a MR for this however, but I'll let you know if I do (unless you update it yourself).

EDIT:

This code spikes the CPU memory from 4Gb to 25Gb

`

    chunks_in = [
            unfolded_input[
            b * self.batch_size:(b + 1) * self.batch_size, ...].clone()
            for b in range(n_batch)
    ]

`

then this code spikes the memory from 25Gb to 60Gb

`

        chunks_out = model_fn(cin.to(original_device))
        del cin
        for s, c in chunks_out.items():
            all_chunks_out[s][b * self.batch_size:(b + 1) * self.batch_size,
            ...] = c.cpu()

`

kwatcharasupat commented 4 months ago

Yep, you are right. It's a big limitation / bad practice that I hacked together /didn't have time to fix.

Because of the OLA, we probably have to look into offloading things into temporary file.

It would be great if I could read the file in chunks but it's a bit of a pain to do in PyTorch, especially if the files are not uncompressed. I guess memmapping (https://pytorch.org/tensordict/stable/reference/generated/tensordict.MemoryMappedTensor.html#tensordict.MemoryMappedTensor) is an option. That's still one entire audio file worth of temp on the input side. We could discard some of the earlier part as it's processed but during the initial load we probably need to read the entire file (mostly for the developers' sanity). It's possible to be more memory efficient but I would rather leave audio decoding alone as much as possible.

On the output side, we are going to need another temp, per stem, of the same size as the uncompressed input to do OLA.

In total, that's at least num_stem+1 uncompressed inputs worth of temp for any given file. I guess it's fine for most modern machines. With some optimization we can probably go down closer to num_stem by cutting the input temp.

Please feel free to make any PR or suggestions on this. I am contemplating if this is something that should be developed semi-independently of Bandit, since it will be useful for any source separation model that do chunked inference...

kwatcharasupat commented 4 months ago

@cwu307 @ruohoruotsi please feel free to suggest if you know of any better way too! It's basically a batched OLA processing, but with a much more complicated centerpiece than a usual DSP system.

kwatcharasupat / bandit

Memory usage #5