RuntimeError:CUDA out of memory.

yuki-0321 commented 4 years ago

pytorch = 1.1.0 I can print net and visualize the net , but when I run train.py , the program was killed in "seg_out, edge_out = net(input)" . Then , I wanted to use "from thop import profile" to count model parameter size and flops , but this also had an error: "RuntimeError:module must have its parameters and buffers on device cuda:0 but found one of them on device:cpu". Then I specify device as 'cuda:0' , but this error still exists.
So I want to know how to solve these errors , can anyone tell me the params and flops of the gscnn net. In other words, how much memory is used to run this gscnn net?

cfanfan commented 4 years ago

Have you solved your problem？ I also encountered the same problem: RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.93 GiB total capacity; 6.91 GiB already allocated; 51.88 MiB free; 122.42 MiB cached) If you solve this problem, could you please tell me the solution！thanks！

cfanfan commented 4 years ago

@yuki-0321

yuki-0321 commented 4 years ago

Hi, I resized the input img to (64, 64) size (in cityscapes.py), then the 'train.py' could run. You can try like this.

cfanfan commented 4 years ago

Hi, I resized the input img to (64, 64) size (in cityscapes.py), then the 'train.py' could run. You can try like this.

Thank you so much ! I tried to modify it the way you said, but it did not work, maybe my revision is not right . Could you please tell me where and how you make the modification? Thanks again ~

yuki-0321 commented 4 years ago

Hi , in cityscapes.py , find the input 'img' and 'mask' ，you can use resize method in PIL library.

cfanfan commented 4 years ago

Hi , in cityscapes.py , find the input 'img' and 'mask' ，you can use resize method in PIL library.

thank you！ thank you! thank you!

cfanfan commented 4 years ago

Hi,How big is your GPU memory? Additionally, the input 'img' and 'mask' ，you can use resize method in PIL library. Which line are these two sentences added to?

shifangtian commented 4 years ago

嗨，您的GPU内存有多大？另外，输入'img'和'mask'，您可以在PIL库中使用resize方法。这两个句子添加到哪一行？

Did you solve this problem? I got the same problem

yuki-0321 commented 4 years ago

My GPU memory almost 80G , these two sentences added in def getitem , after 'img' and 'mask'.

HAOCHENYE commented 4 years ago

You can just change '--crop_size' in train.py... I can train the net with '--bs_mult=3', '--crop_size=336', '--bs_mult_val=1'（GPU memory 8G）

paul-adlink commented 4 years ago

I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init.py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08:31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating: 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500

HAOCHENYE commented 4 years ago

Actually， this project only supports wideresnet because there is no code about(resnet50 or resnet101) in gscnn.py(Although the net is defined in file network).

paul-adlink commented 4 years ago

@HAOCHENYE Thanks for telling that and sure that it can not training using resnet50 or resnet101 under this current version. And have you successfully training using wideresnet on single GPU? (Seems author mentioned that can not train on single GPU card.)

HAOCHENYE commented 4 years ago

make --syncbn=False in train.py. Besides, I also delete some data augmention and change its image transformation because the loss is hard to converge for single GPU card based on released code. I'm still training the net and the loss seems to converge.

shifangtian commented 4 years ago

Thank you very much for your reply. Did you use Resnet50 to train from scratch instead of using the original pre-training checkpoint? So how do I modify the code if I start training from scratch? Can you share this part of your code with me ? My professional level is very low, can you help me?By the way, can you share with me the environment configuration other than GPU memory? ------------------ 原始邮件 ------------------ 发件人: "paul-adlink"<notifications@github.com>; 发送时间: 2019年12月10日(星期二)下午4:34 收件人: "nv-tlabs/GSCNN"<GSCNN@noreply.github.com>; 抄送: "剑可入鞘否"<404716439@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34)

I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init .py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08 :31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating : 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500

— You are receiving this because you commented. Reply to this email directly, view it on GitHub , or unsubscribe .

shifangtian commented 4 years ago

It ’s a pity that my GPU memory is only 11g

------------------ 原始邮件 ------------------ 发件人: "yuki-0321"<notifications@github.com>; 发送时间: 2019年11月27日(星期三) 晚上6:01 收件人: "nv-tlabs/GSCNN"<GSCNN@noreply.github.com>; 抄送: "剑可入鞘否"<404716439@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34)

My GPU memory almost 80G , these two sentences added in def getitem , after 'img' and 'mask'.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

HAOCHENYE commented 4 years ago

@shifangtian You can follow my repo https://github.com/HAOCHENYE/GSCNN-for-Single-GPU This net is very hard to training on single GPU,my mean iou only reached 0.4084

paul-adlink commented 4 years ago

make --syncbn=False in train.py. Besides, I also delete some data augmention and change its image transformation because the loss is hard to converge for single GPU card based on released code. I'm still training the net and the loss seems to converge.

@HAOCHENYE Thanks!!! I will try it and I think it can train(Might met the same converge problem as you meet now) of course reducing the argumentation to limit the memory usage!

paul-adlink commented 4 years ago

Thank you very much for your reply. Did you use Resnet50 to train from scratch instead of using the original pre-training checkpoint? So how do I modify the code if I start training from scratch? Can you share this part of your code with me ? My professional level is very low, can you help me?By the way, can you share with me the environment configuration other than GPU memory? ------------------ 原始邮件 ------------------ 发件人: "paul-adlink"<notifications@github.com>; 发送时间: 2019年12月10日(星期二)下午4:34 收件人: "nv-tlabs/GSCNN"<GSCNN@noreply.github.com>; 抄送: "剑可入鞘否"<404716439@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [nv-tlabs/GSCNN] RuntimeError:CUDA out of memory. (#34) I also met the cuda out of memory problem also, and then I change the architecture using '--trunk resnet50' without using '--snapshot checkpoints/best_cityscapes_checkpoint.pth'. Besides, I also use the parameter suggested by @HAOCHENYE , Seems it starting to train but very slow.(Cause I change the args.num_workers from 4 to 1 in init .py, this is because I only use one GPU<Titan Xp, 12G memory> card to train). Seems consume around 7G while execute training. I'm still observing the training process, not sure this is the training start from scratch. below is the validating console out looks like: 12-10 08:06:15.581 validating: 1 / 500 12-10 08:08 :31.224 validating: 21 / 500 12-10 08:10:43.240 validating: 41 / 500 12-10 08:12:57.430 validating: 61 / 500 12-10 08:15:09.786 validating: 81 / 500 12-10 08:17:23.038 validating: 101 / 500 12-10 08:19:35.823 validating: 121 / 500 12-10 08:21:49.924 validating: 141 / 500 12-10 08:24:01.765 validating: 161 / 500 12-10 08:26:15.050 validating: 181 / 500 12-10 08:28:28.284 validating : 201 / 500 12-10 08:30:42.002 validating: 221 / 500 12-10 08:32:53.939 validating: 241 / 500 — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or unsubscribe .

@HAOCHENYE mentioned there's no ResNet50/101 implemented in this project now(2019/12/16). And I also found that. About the settings where you can refer to the previous discussion comment about the settings by @HAOCHENYE and me. Good Luck!

arieling commented 4 years ago

as stated in README. To reproduce numbers in the paper you need at least 8 GPUs. I would recommend you try at least 8 16GB (I reproduce numbers by 8 32 GB)

arieling commented 4 years ago

@shifangtian You can follow my repo https://github.com/HAOCHENYE/GSCNN-for-Single-GPU This net is very hard to training on single GPU,my mean iou only reached 0.4084

WiderResNet38 can't be trained on single GPU. Please use at least 8 * 16G

nv-tlabs / GSCNN

RuntimeError:CUDA out of memory. #34