Closed miquelflorensa closed 2 months ago
Miquel, thanks for pointing out the issue. I will work on this issue this week
@miquelflorensa and @jamesgoulet, I confirmed the issue comes from python binding code I've tested C++ version and run memory check. There is no memory leak in C++/CUDA. I am working on the fix for Python binding codes
@miquelflorensa I finally fixed the issue of memory leak. See PR #72. The issue came from here. When batch size changes over the course of training, this condition will be triggered, but there is no memory deallocation done before allocating new block memory. I appreciate you've tested resnet18 and pointed it out to me. Please do test it again to make sure that there is no memory leak
@lhnguyen102 Hi Ha,
I am not sure if it is supposed to be like this, but I believe there is a memory leakage on the ResNet architecture in CUDA.
While running the resnet18_cifar10 example in the server, it runs out of memory on epoch number 4:
Epoch 4/10 | training error: 43.58% | test error: 48.08%: 40%|██████████████▊ | 4/10 [25:27<38:13, 382.20s/it]CUDA Runtime Error at: /home/mf/Documents/RESNET/cuTAGI/src/data_struct_cuda.cu:83 out of memory cudaMalloc(&this->d_jcb, size * sizeof(float))
I tracked the memory usage during the epochs and this is what I got: