ucbdrive / few-shot-object-detection

Implementations of few-shot object detection benchmarks
Apache License 2.0
1.08k stars 225 forks source link

CUDA out of memory #127

Closed AISoltani closed 2 years ago

AISoltani commented 2 years ago

hi, i have a problem for training first stage with this command and config:

python3 -m tools.train_net --num-gpus 1 --config-file configs/PascalVOC-detection/split1/faster_rcnn_R_101_FPN_base1.yaml

This is the last three lines of error:

result = self.forward(*input, *kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/batch_norm.py", line 54, in forward eturn x scale.to(out_dtype) + bias.to(out_dtype) RuntimeError: CUDA out of memory. Tried to allocate 464.00 MiB (GPU 0; 5.94 GiB total capacity; 4.47 GiB already allocated; 94.38 MiB free; 4.93 GiB reserved in total by PyTorch)

i think issue should be relevant to batchsize but how can i change it for first training stage?

I thank anyone who can help to solve my problem

alphacyp commented 2 years ago

Is it just tools.trainnet that can't run? Maybe first you need to check your learning rate and IMS PERBatch. For example, it seems that you use one GPU, the corresponding learning rate and IMS PER_ Batch should be reduced to one eighth of the original. In addition, whether the versions of pytorch and detectron2 are the same as those of the author. If not, you may not be able to run

AISoltani commented 2 years ago

Is it just tools.trainnet that can't run? Maybe first you need to check your learning rate and IMS PERBatch. For example, it seems that you use one GPU, the corresponding learning rate and IMS PER_ Batch should be reduced to one eighth of the original. In addition, whether the versions of pytorch and detectron2 are the same as those of the author. If not, you may not be able to run

first i thank you for your answer... i could run demo perfectly fine and yes just train_net, It actually runs and even goes in the training epoche loop, But immediately jumps out, i checked inside yaml file (faster_rcnn_R_101_FPNbase1) but i didn't see any variable like IMS PER_Batch, you are absolutely right about those parameters but exactly in which file i can change them? this is my complete log: Traceback (most recent call last): File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/Desktop/fsod/fsdet/modeling/meta_arch/rcnn.py", line 111, in forward features = self.backbone(images.tensor) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward bottom_up_features = self.bottom_up(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 448, in forward x = stage(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/batch_norm.py", line 54, in forward return x * scale.to(out_dtype) + bias.to(out_dtype) RuntimeError: CUDA out of memory. Tried to allocate 526.00 MiB (GPU 0; 5.94 GiB total capacity; 4.52 GiB already allocated; 190.56 MiB free; 5.05 GiB reserved in total by PyTorch) [08/13 18:20:41 d2.engine.hooks]: Total training time: 0:00:02 (0:00:00 on hooks) [08/13 18:20:41 d2.utils.events]: iter: 0 lr: N/A max_mem: 4893M Traceback (most recent call last): File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/cudaa/Desktop/fsod/tools/train_net.py", line 113, in <module> args=(args,), File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "/home/cudaa/Desktop/fsod/tools/train_net.py", line 101, in main return trainer.train() File "/home/cudaa/Desktop/fsod/fsdet/engine/defaults.py", line 445, in train super().train(self.start_iter, self.max_iter) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/Desktop/fsod/fsdet/modeling/meta_arch/rcnn.py", line 111, in forward features = self.backbone(images.tensor) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward bottom_up_features = self.bottom_up(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 448, in forward x = stage(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/batch_norm.py", line 54, in forward return x * scale.to(out_dtype) + bias.to(out_dtype) RuntimeError: CUDA out of memory. Tried to allocate 526.00 MiB (GPU 0; 5.94 GiB total capacity; 4.52 GiB already allocated; 190.56 MiB free; 5.05 GiB reserved in total by PyTorch)

alphacyp commented 2 years ago

只是tools.train_net 不能运行吗?也许首先你需要检查你的学习率和 IMS_PER_Batch。比如你好像用了一个GPU,相应的学习率和IMS_PER_Batch应该会降到原来的八分之一。另外pytorch和detectron2的版本是否和作者的一样。如果没有,您可能无法运行

首先,我感谢您的回答... 我可以完美地运行演示,是的,只是 train_net,它实际上运行,甚至进入训练纪元循环,但立即跳出,我检查了 yaml 文件(faster_rcnn_R_101_FPN_base1),但我没有没有看到像 IMS_PER_Batch 这样的任何变量,你对这些参数是完全正确的,但我可以在哪个文件中更改它们? 这是我的完整日志: Traceback (most recent call last): File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/Desktop/fsod/fsdet/modeling/meta_arch/rcnn.py", line 111, in forward features = self.backbone(images.tensor) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward bottom_up_features = self.bottom_up(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 448, in forward x = stage(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/batch_norm.py", line 54, in forward return x * scale.to(out_dtype) + bias.to(out_dtype) RuntimeError: CUDA out of memory. Tried to allocate 526.00 MiB (GPU 0; 5.94 GiB total capacity; 4.52 GiB already allocated; 190.56 MiB free; 5.05 GiB reserved in total by PyTorch) [08/13 18:20:41 d2.engine.hooks]: Total training time: 0:00:02 (0:00:00 on hooks) [08/13 18:20:41 d2.utils.events]: iter: 0 lr: N/A max_mem: 4893M Traceback (most recent call last): File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/cudaa/Desktop/fsod/tools/train_net.py", line 113, in <module> args=(args,), File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "/home/cudaa/Desktop/fsod/tools/train_net.py", line 101, in main return trainer.train() File "/home/cudaa/Desktop/fsod/fsdet/engine/defaults.py", line 445, in train super().train(self.start_iter, self.max_iter) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/Desktop/fsod/fsdet/modeling/meta_arch/rcnn.py", line 111, in forward features = self.backbone(images.tensor) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward bottom_up_features = self.bottom_up(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 448, in forward x = stage(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/modeling/backbone/resnet.py", line 201, in forward out = self.conv3(out) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/wrappers.py", line 88, in forward x = self.norm(x) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/cudaa/anaconda3/envs/fsod/lib/python3.6/site-packages/detectron2/layers/batch_norm.py", line 54, in forward return x * scale.to(out_dtype) + bias.to(out_dtype) RuntimeError: CUDA out of memory. Tried to allocate 526.00 MiB (GPU 0; 5.94 GiB total capacity; 4.52 GiB already allocated; 190.56 MiB free; 5.05 GiB reserved in total by PyTorch)

This parameter is in Base-RCNN-FPN.yaml

AISoltani commented 2 years ago

只是tools.train_net 不能运行吗?也许首先你需要检查你的学习率和 IMS_PER_Batch。比如你好像用了一个GPU,相应的学习率和IMS_PER_Batch应该会降到原来的八分之一。另外pytorch和detectron2的版本是否和作者的一样。如果没有,您可能无法运行

首先,我感谢您的回答... 我可以完美地运行演示,是的,只是 train_net,它实际上运行,甚至进入训练纪元循环,但立即跳出,我检查了 yaml 文件(faster_rcnn_R_101_FPN_base1),但我没有没有看到像 IMS_PER_Batch 这样的任何变量,你对这些参数是完全正确的,但我可以在哪个文件中更改它们? 这是我的完整日志:

This parameter is in Base-RCNN-FPN.yaml

thanks, that's work!, but i changed just IMS_PER_Batch to 2 and LR as default is 0.02 without any change and code sometimes go running without memory problem and sometimes randomly jump out for example in iter 299, why randomly? and what is the effect of LR to GPU memory? i thanks you if you explain this ...

alphacyp commented 2 years ago

只是tools.train_net不能运行吗?也许首先你需要检查你的学习率和IMS_PER_Batch。比如你好像用了一个GPU,相应的学习率和IMS_PER_Batch应该会降到原来的八分之一。另外pytorch和detectron2的版本是否和作者一样。 如果没有,您可能无法运行

首先,我感谢您的回答... 我可以完美地运行演示,是的,只是train_net,它实际上在运行,甚至进入训练纪元循环,但立即跳出,我检查了yaml文件(faster_rcnn_R_101_FPN_base1),我没有没有看到像 IMS_PER_Batch 这样的任何变量,你对这些参数是完全正确的,但我可以在哪个文件中更改它们? 这是我的完整日志:

这个参数在Base-RCNN-FPN.yaml

谢谢,那行得通!,但我只将 IMS_PER_Batch 更改为 2,LR 作为默认值更改为 0.02,没有任何更改,代码有时会在没有内存问题的情况下运行,有时会随机跳出,例如在 iter 299 中,为什么随机?LR对GPU内存的影响是什么?如果你解释一下,我感谢你......

只是tools.train_net 不能运行吗?也许首先你需要检查你的学习率和 IMS_PER_Batch。比如你好像用了一个GPU,相应的学习率和IMS_PER_Batch应该会降到原来的八分之一。另外pytorch和detectron2的版本是否和作者的一样。如果没有,您可能无法运行

首先,我感谢您的回答... 我可以完美地运行演示,是的,只是 train_net,它实际上运行,甚至进入训练纪元循环,但立即跳出,我检查了 yaml 文件(faster_rcnn_R_101_FPN_base1),但我没有没有看到像 IMS_PER_Batch 这样的任何变量,你对这些参数是完全正确的,但我可以在哪个文件中更改它们? 这是我的完整日志:

This parameter is in Base-RCNN-FPN.yaml

thanks, that's work!, but i changed just IMS_PER_Batch to 2 and LR as default is 0.02 without any change and code sometimes go running without memory problem and sometimes randomly jump out for example in iter 299, why randomly? and what is the effect of LR to GPU memory? i thanks you if you explain this ...

Maybe you change the LR to one eighth of the original and try again?

AISoltani commented 2 years ago

Maybe you change the LR to one eighth of the original and try again? yes it works, thanks, i'm now working on register on new dataset, i think it's so hard ...

alphacyp commented 2 years ago

Maybe you change the LR to one eighth of the original and try again? yes it works, thanks, i'm now working on register on new dataset, i think it's so hard ...

That's OK. If you have the test results on the new dataset, we can continue to communicate. Thanks!