Open OleguerCanal opened 2 years ago
When the audio input length is long, the memory seems to explode.
@sooftware so is it designed in such way that the input length increases within an epoch?
That's not true. Perhaps every time a memory is held in a GPU, the memory held in a cache increases and seems to be increasing. Then the memory explodes. (I think)
Any ideas on how to solve it?
same problem encountered, any ideas to slove?
Hi, Not sure if it's really a memory leak, as the audio batches can have different length during training. In my case I had the GPU mem % increasing then decreasing during the training.
Have you tried to decrease the batch size in order to have some room left for longer sequence batches ?
I think @virgile-blg is right, although it is a bit weird that it keeps increasing over the first epochs and then it becomes more stable
One option is to disable Pytorch Lightning's auto_scale_batch_size
. When set to False
there is not OOM error during the 1st epoch.
I guess that it is scaling the batch size not using the biggest sequence in the training set.
❓ Questions & Help
I am training using the random sampler and using ddp_sharded strategy, however after some training steps I get a CUDA out-of-memory error.
Details
I am training on a SLURM-managed cluster using 2 nodes with 2 Tesla M60 (8GB) GPUs each. As I understand if the model doesnt fit in a single GPU, pytorch-lightning would automatically scatter it across different ones.
As expected, the larger I make the batch size the sooner the error occurs. What I find strange is that it takes a few iterations to crash, if the model and the batch didn't fit wouldn't it crash directly?
Here I attach a picture of the memory usage:
And these are the parameters I'm using:
Thank you a lot guys!