open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.98k stars 2.57k forks source link

Too much GPU memory consumption and seems wrong #3167

Open lxj1999 opened 1 year ago

lxj1999 commented 1 year ago

When I use resnet101 with deeplabv3+ in this lib 'https://github.com/VainF/DeepLabV3Plus-Pytorch', I can train for 40 batch size on 2 GPUs, 20 batch size for each GPU, and it consumes about 36 GB. However, when I use your lib, for 2 batch size on 1 GPUs, it consumes about 15 GB

lxj1999 commented 1 year ago

The batch size 2 is too big for my two 24G GPUs, it crush down when doing the 2400 iteration and needs to restart the computer.

wujiang0156 commented 1 year ago

The batch size 4 is too big for my two 24G GPUs, my input img is 512x512. when the batch size 8 , my server out of memory. How many pixels is the picture? @lxj1999

lxj1999 commented 1 year ago

My image is 840 by 480

wujiang0156 commented 1 year ago

My image is 840 by 480 so your batch size 2 , the same as 512. out of memory! why is the bath size so small? @everyone

wujiang0156 commented 1 year ago

Batch size increased but speed decreased? why?

lxj1999 commented 1 year ago

My image is 840 by 480 so your batch size 2 , the same as 512. out of memory! why is the bath size so small? @everyone

The library is over memory consumption, maybe some optimization issue and memory leakage.

lxj1999 commented 1 year ago

Batch size increased but speed decreased? why?

The same issue, when I use batch size 2, although only takes up 15G out of 24G memory, the GPU is full-power and usage is 100%, not usual in other librarys.

wujiang0156 commented 1 year ago

image 8xb2 means batch size is 2 on 1gpu, need 8 gpus for training, so we need 8 gpus for multigpus training, oh ? memory consumption?

wujiang0156 commented 1 year ago

when I use 2 gpus, the speed is slower than 1gpu. Have you ever had this kind of problem?

wujiang0156 commented 1 year ago

@lxj1999 image Training is performed by mmengine. Maybe it's where the memory is consumed. do you run the mmengine?

davidhuangal commented 11 months ago

I am having this issue as well, specifically with DeepLabV3+ models.

tlian96 commented 11 months ago

Hi, guys. Does anyone solve this kind of problems. When I use detectron2 to train Deeplabv3+ with res101 backbone, I set batch size 8 with crop size (512,1024) for my 24GB gpu. But for mmsegmentation, it will run out of gpu memory if I use batch size 8. It seems mmsegmentation consum too much memory than other library.

tlian96 commented 11 months ago

Hi, did you solve this problem. I think I stuck in the same situation with you.

wujiang0156 commented 11 months ago

so try to use multi-GPU to solve based on linux

davidhuangal commented 11 months ago

That might solve the out of memory error, but there's still the issue that their implementation of DeepLabV3+ uses too much memory in the first place.

tlian96 commented 11 months ago

That might solve the out of memory error, but there's still the issue that their implementation of DeepLabV3+ uses too much memory in the first place.

Yes, I think the problem is just on Deeplabv3+. I test a part of other model, it works well without any memory error. But maybe now there is nobody to solve this issue

lxj1999 commented 11 months ago

这可能会解决内存不足的错误,但仍然存在一个问题,即他们的 DeepLabV3+ 实现首先使用了太多内存。

是的,我认为问题出在 Deeplabv3+ 上。我测试了其他模型的一部分,它运行良好,没有任何内存错误。但也许现在没有人能解决这个问题

Actually bad for other models too. I have tried unet and swin transform, both break down at 2 batch size for 48G gpus.

lxj1999 commented 11 months ago

I thick this problem come from their basic structure, change for another available open-source library may be a better choice, do not stuck in this stupid memory issue.