2080ti one gpu gives out of memory error while training

rasmushaugaard / surfemb

SurfEmb (CVPR 2022)

https://surfemb.github.io/

MIT License

77 stars 17 forks source link

2080ti one gpu gives out of memory error while training #19

Closed smoothumut closed 2 years ago

smoothumut commented 2 years ago

Hi, thanks for your great work, I hope I will make it work and be able to use it on my custom dataset. my problem is this; I have one 2080ti and I am trying to train the tless pbr dataset but I get an error "cuda out of memory" . I have used smaller batch size which is 8, I have decreased the number of workers to 0. but it keeps giving the error ( ok now it gives the error later than before but it still gives the error)

it only works if I decrease the scenes from 50 to 1 in train_pbr folder. otherwise no chance.

is this normal behavior with this one gpu , or I am missing something

thanks in advance Screenshot from 2022-06-13 11-00-50

rasmushaugaard commented 2 years ago

It shouldn't use that much memory. I'm training on a single RTX 2080 with 8 GB of ram. Also, I don't see why changing the number of scenes would affect GPU memory usage. I'm not sure what's going on there. Let me know, if you find out.

BR, Rasmus

smoothumut commented 2 years ago

Hi, thanks for the quick response.

I have noticed that there is a memory leak on surface_embedding.py I guess it is logging during training and thus it increases the memory usage in every 4-5 seconds. I have checked it with "nvidia-smi" command during training

So I have commented on these 3 lines and now it even works with batch-size=32 and multiple threads :)
self.log(f'{log_prefix}/loss', loss) self.log(f'{log_prefix}/mask_loss', mask_loss) self.log(f'{log_prefix}/nce_loss', nce_loss)

I hope this disabled logging wont break things

thanks again for the great work,

rasmushaugaard commented 2 years ago

That's weird. Disabling logging won't break anything, but if you want logging, try to update pytorch lightning, in case it's a bug in a specific version, or call .item() on the losses to be logged.