lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
2.18k stars 236 forks source link

Memory of gpu & cpu keep increasing during training (pytorch) #172

Open aylive opened 8 months ago

aylive commented 8 months ago

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

  1. intel core i9 + rtx4090*1
  2. intel xeon + rtx3080*1 both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

lyuwenyu commented 8 months ago

https://github.com/lyuwenyu/RT-DETR/issues/93

I don't know where the problem is either. But I will release a new version codebase in future, you can star and keep following updates.

tommyjiang commented 8 months ago

Impressive and very helpful work. Just a little confuse, when trying to repeat the training on COCO with PyTorch implement (default configs), I noticed that the memory of the CPU and GPU both keep increasing as the iteration goes on. I tried this on two servers,

  1. intel core i9 + rtx4090*1
  2. intel xeon + rtx3080*1 both of a single GPU (:< sorry for no more details about the servers, I'll add more details if needed)

As for now, the training process has not been killed due to insufficient memory. But as the CPU memory gets to be all taken up, the training speed slows down a lot.

I'm really struggling with this. Great thankfulness for your help.

Do you run evaluation after each training epoch? I tried to turn off evaluation, and the speed is much faster. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

lyuwenyu commented 8 months ago

Yes, I do run evaluation after each epoch.

tommyjiang commented 8 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

lyuwenyu commented 8 months ago

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

tommyjiang commented 8 months ago

. I wonder the PyTorch implementation for COCO, please share more infos, thanks a lot!

Very useful information, perhaps you are right

Thanks, and thank you for the great work! I will do more test locally to find out this issue if I have some time.

aylive commented 8 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

tommyjiang commented 8 months ago

Yes, I do run evaluation after each epoch.

Thanks. I meet the same memory issue as reported in #93 and here. After I turn off evaluation after each training epoch, the performance seems to be OK.

Thanks for your info. I'll try this. Just I have to fintune on my own dataset, no evaluation hampers me from stopping before overfitting. How do you solve this problem?

Just manually eval each epoch's model. For finetune maybe 3-5 epoches are enough.