Memory of gpu & cpu keep increasing during training (pytorch)

lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥

Apache License 2.0

2.61k stars 303 forks source link

Memory of gpu & cpu keep increasing during training (pytorch) #172

Open aylive opened 10 months ago

aylive commented 10 months ago

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

intel core i9 + rtx4090*1
intel xeon + rtx3080*1 both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

lyuwenyu commented 10 months ago

https://github.com/lyuwenyu/RT-DETR/issues/93

I don't know where the problem is either. But I will release a new version codebase in future, you can star and keep following updates.

tommyjiang commented 10 months ago

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

intel core i9 + rtx4090*1

intel xeon + rtx3080*1 both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

Do you run evaluation after each training epoch? I tried to turn off evaluation, and the speed is much faster. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

lyuwenyu commented 10 months ago

Yes, I do run evaluation after each epoch.

tommyjiang commented 10 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

lyuwenyu commented 10 months ago

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

tommyjiang commented 10 months ago

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

Thanks, and thank you for the great work! I will do more test locally to find out this issue if I have some time.

aylive commented 10 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

tommyjiang commented 10 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Just manually eval each epoch's model. For finetune maybe 3-5 epoches are enough.