Training on a gpu <=8Gb

lfranke / TRIPS

https://lfranke.github.io/trips/

MIT License

495 stars 28 forks source link

Training on a gpu <=8Gb #10

Closed anslex closed 4 months ago

anslex commented 4 months ago

Hello,

Thank you for your project. I have a 8Gb RTX 4070. By any chance do you know how to limit the memory usage during training at checkpoint save?

RuntimeError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 7.53 GiB total capacity; 5.31 GiB already allocated; 355.88 MiB free; 5.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[1]    7367 IOT instruction (core dumped)  ./build/bin/train --config configs/train_normalnet.ini  tt_train    1  1  256

lfranke commented 4 months ago

Hi, just to make sure, you started training with ./build/bin/train --config configs/train_normalnet.ini --TrainParams.scene_names tt_train --TrainParams.batch_size 1 --TrainParams.inner_batch_size 1 --TrainParams.train_crop_size 256 ? (just asking as the error output is missing the argument names) and you are using the tt_train scene from our supplemental?

anslex commented 4 months ago

Hi Linus,

Yes, exactly like that. Additionally, I have tried not to save images with every checkpoint save and/or to further reduce the crop size, but without success.

I am going to dive into the code to understand why there is an attempt to allocate an additional ~508.00 MiB on the checkpoint save.

anslex commented 4 months ago

Hello,

I have found that it is due to lt_vgg = loss_vgg->forward(x, target);and so I have forced want_eval = false; to ignore eval and test epochs

lfranke commented 4 months ago

Hi, good work-around :) This might also apply for you: https://github.com/lfranke/TRIPS/issues/26#issuecomment-1956377052 Not using VGG reduces VRAM, however this will also impact quality