Increasing training VRAM doesn't allow for bigger segmentation models

lamaeldo commented 11 months ago

Until recently I was training segmentation models on my personal machine (32 GB RAM, 12 GB VRAM), and had to adapt my architectures and batch size to avoid OOM errors. I've recently gotten access to a HPC cluster (nodes have 120 GB RAM, 80 GB VRAM) and assumed that this would allow me to scale up my training parameters and architectures. My base architecture (which fits within my personal machine's 12 GB of VRAM) is: [1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32 Cr3,3,32 Gn32 Cr3,3,64 Gn32] Increasing the input size to 1500, removing the first dilation, or increasing my batch size from 1 to 4+ all cause OOM during the first epoch. I'm not very clear on the training process, but I assume that multiplying my VRAM by 6 should allow me to do this without even coming close to saturating the GPU's memory, is that the case? If so, any chance this could be a memory leak from Kraken? Or did I misconfigure my training VM? I have tweaked and tested with many different package versions, to no avail Some versions: Kraken 4.3.13 Torch 2.0.1+cu11.7 Torchvision 0.13.1 CUDA 12.0 (unfortunately I can't change this to match Torch, but Kraken works fine with 12.0 on my personal machine) Any opinion or tips?

mittagessen commented 11 months ago

Are you getting CUDA, main memory, or shared memory OOMs? By and large you should be able to scale up the architecture until all VRAM is gobbled up but sometimes there are other system limits that would prevent that (especially on an HPC cluster).

lamaeldo commented 11 months ago

Sorry, I'm getting CUDA OOMs. In some cases the issue is caused by Torch trying to grab obscene amounts of memory at once (25gb when it already has 60). I tried to look around to see how to reduce the increments of memory it reserves at once, but with no success. In any case, this also occurs when it tries to request 1gb on top of the 79 it already has, so it is surely not the only issue at play. Could you expand on the "other system limits" you are referring to?

mittagessen commented 11 months ago

Those would be whatever the cluster admins set up. Most frequently those are shared memory, CPU cycles, storage, things like that.

The pytorch memory allocator is a bit simplistic and doesn't deal well with fragmentation of its memory pool instead just increasing it willy-nilly if something doesn't fit in an existing free area in it. I guess it runs for a couple of iterations and then crashes? Or does it just try to immediately allocate all this space and gets killed?

lamaeldo commented 11 months ago

I will investigate potential ressources limits on the nodes, thanks for the advice! Yes, training produces an OOM usually during the first, sometimes second epoch (it doesn't crash at the very beginning of training). I had no clue that pytorch would manage memory so randomly, but even then, that it can request an additional 25 gb at once seems crazy to me

lamaeldo commented 11 months ago

By re-adding a dilation that I had removed from my first convolutional layer (#527), I can scale my architecture much more (I can more than double the input size or the batch size) although not in the proportion that I would expect going from an RTX 3060 to a P100, but I suppose this could be caused by other system limits, and expecting a linear scale up simply on the basis of GPU memory is irrealistic. As this solves my current problem, I will close this issue but if you think there could be a real issue behind this and want me to run any experiment, do let me know. Thanks for the help!

mittagessen / kraken

Increasing training VRAM doesn't allow for bigger segmentation models #535