lhnguyen102 / cuTAGI

CUDA implementation of Tractable Approximate Gaussian Inference
MIT License
30 stars 9 forks source link

Possible memory leakage ResNet CUDA #71

Closed miquelflorensa closed 2 months ago

miquelflorensa commented 3 months ago

@lhnguyen102 Hi Ha,

I am not sure if it is supposed to be like this, but I believe there is a memory leakage on the ResNet architecture in CUDA.

While running the resnet18_cifar10 example in the server, it runs out of memory on epoch number 4: Epoch 4/10 | training error: 43.58% | test error: 48.08%: 40%|██████████████▊ | 4/10 [25:27<38:13, 382.20s/it]CUDA Runtime Error at: /home/mf/Documents/RESNET/cuTAGI/src/data_struct_cuda.cu:83 out of memory cudaMalloc(&this->d_jcb, size * sizeof(float))

I tracked the memory usage during the epochs and this is what I got:

lhnguyen102 commented 3 months ago

Miquel, thanks for pointing out the issue. I will work on this issue this week

lhnguyen102 commented 3 months ago

@miquelflorensa and @jamesgoulet, I confirmed the issue comes from python binding code I've tested C++ version and run memory check. There is no memory leak in C++/CUDA. I am working on the fix for Python binding codes

lhnguyen102 commented 2 months ago

@miquelflorensa I finally fixed the issue of memory leak. See PR #72. The issue came from here. When batch size changes over the course of training, this condition will be triggered, but there is no memory deallocation done before allocating new block memory. I appreciate you've tested resnet18 and pointed it out to me. Please do test it again to make sure that there is no memory leak