Closed Benybrahim closed 3 years ago
Hi, @Benybrahim
You can try with_cp = True
, please check out here for more details.
By the way, SETR needs a lot computation capacities, I suggest you change some non-transformer models which needs less GPU memory because batch_size=1
on GTX 1080 Ti can not ensure normal model training.
Best,
Thank you @MengzhangLI .
Better GPU will instantely solve the problem.
Using with_cp = True
will not work in my case since the VIT_MLA
model that I'm using is customized and doesn't have with_cp
parameters.
I posted the question here too: https://github.com/LARC-CMU-SMU/FoodSeg103-Benchmark-v1
Thanks you for you again.
Thank you @MengzhangLI .
Better GPU will instantely solve the problem.
Using
with_cp = True
will not work in my case since theVIT_MLA
model that I'm using is customized and doesn't havewith_cp
parameters.I posted the question here too: https://github.com/LARC-CMU-SMU/FoodSeg103-Benchmark-v1
Thanks you for you again.
OK, very happy to hear you would not be bothered by GPU memory error.
Do you think it is meaningful to integrate this FoodSeg103
dataset into MMSegmentation and what problems may we meet if we plan to integrate it? We are interested in supporting more datasets.
Best,
I think you can ask @XiongweiWu since he is the owner of the repo, but It should be meaningful I guess, since it is the only dataset with food ingredient segmentation.
@Benybrahim I am very glad to see you have addressed the problem and thx for informing me this information.
@MengzhangLI Hi, First thx for your suggestion! I think the dataset is meaningful since it's the only dataset for fine-grained food ingredient segmentation, and we are also glad to have more researchers involved in this task. I need to carefully discuss with our project leader about it to avoid any license issue and I will update you when we finish.
Hi, @XiongweiWu
Thanks for your nice reply. We do hope we could support this great dataset for community and it absolutely could make more researchers involved in it.
Feel free to contact us anytime.
Best,
@MengzhangLI Hi Mengzhang, I am sorry for replying late since I am involved into one covid-19 positive case recently (luckily I am negative finally...).... I have just saw your email and our project leader also agreed to merge the dataset into the official mmsegmentation repo, but we hope other researcher can still download the dataset via the application form (so that we can trace the download record) and cite our paper if they use the dataset.
@MengzhangLI Hi Mengzhang, I am sorry for replying late since I am involved into one covid-19 positive case recently (luckily I am negative finally...).... I have just saw your email and our project leader also agreed to merge the dataset into the official mmsegmentation repo, but we hope other researcher can still download the dataset via the application form (so that we can trace the download record) and cite our paper if they use the dataset.
Wow, it is pretty good! Thanks for your nice and generous support!
Also very happy to hear you are negative and healthy.
You can see our previous dataset preparation that we strictly follow rules of data usage and user of MMSegmentation must go to original website of dataset to write license and finish registration. So we absolutely meet up your leaders requirements. ;)
Let us keep in touch and hope we could support benchmark models as soon as possible!
@MengzhangLI Hi , sorry for the basic question.
I get the same Error regarding GPU memmory when trying to train PSPNet with cityscapes dataset on a Single GPU.
I have already changed the SyncBN
to BN
in the config file. Set batch_size=1
, used with_cp = True
and still get a GPU memmory error.
GPU model is NVIDIA GTX 1060
Thanks in advance
Here is the log:
and this is what appears after the log in the terminal :
2021-09-07 18:20:27,626 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
File "tools/train.py", line 167, in
Here is the log:
and this is what appears after the log in the terminal :
2021-09-07 18:20:27,626 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters Traceback (most recent call last): File "tools/train.py", line 167, in main() File "tools/train.py", line 156, in main train_segmentor( File "/home/babak/virtualenvs/env3/mmsegmentation/mmseg/apis/train.py", line 120, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], *_kwargs) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train self.call_hook('after_train_iter') File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 936.00 MiB (GPU 0; 5.93 GiB total capacity; 2.59 GiB already allocated; 1022.25 MiB free; 2.78 GiB reserved in total by PyTorch) Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f83f4db71e2 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1e64b (0x7f83f500d64b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1f464 (0x7f83f500e464 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::cuda::CUDACachingAllocator::raw_alloc(unsigned long) + 0x5e (0x7f83f50078de in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0xeddf66 (0x7f83f60fdf66 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xee2a4c (0x7f83f6102a4c in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xedb41a (0x7f83f60fb41a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xedbc2e (0x7f83f60fbc2e in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xedc2f0 (0x7f83f60fc2f0 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #9: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f83f60fc549 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #10: + 0xf3cf1b (0x7f83f615cf1b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #11: + 0xf6cb58 (0x7f83f618cb58 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #12: at::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x1ad (0x7f842d0032ad in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x18a (0x7f83f60f616a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #14: + 0xf3ce25 (0x7f83f615ce25 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #15: + 0xf6cbb4 (0x7f83f618cbb4 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #16: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: + 0x2ec9c62 (0x7f842ecd5c62 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: + 0x2ede224 (0x7f842ecea224 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #19: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x258 (0x7f842eb5cc38 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #21: + 0x3375bb7 (0x7f842f181bb7 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #22: torch::autograd::Engine::evaluate_function(std::sharedptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f842f17d400 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #23: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f842f17dfa1 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #24: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f842f176119 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #25: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f843c91170a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #26: + 0xd6de4 (0x7f843d845de4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #27: + 0x9609 (0x7f8454f1b609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #28: clone + 0x43 (0x7f8455057293 in /lib/x86_64-linux-gnu/libc.so.6)
I think it is caused by small GPU memory. Maybe you can try some tiny models with ResNet18 backbone.
Looking forward to your feedback.
@MengzhangLI Thank you so much for the quick response and help. I tried PSPNet, Deeplabv3 and FCN , all of them with ResNet backbone with depth 18. According to the config tables all three of them are suppose to use GPU memory less than 2 GBs. Strangely I got the same GPU memory error with PSPNet and Deeplabv3 but FCN is working now. Although different models but with the same backbone and depth the allocated memory should be on the same level right? Unless the listed memory usage in the config tables is per GPU (you used 4 in your trainings if I'm correct) but that should have been modified with the batch_size. is there a way that I can modify memory usage and trade it with speed? the card that I have now has 6 GB of memory and still problematic. I appreciate it very much if you give me your insights on this.
@MengzhangLI Thank you so much for the quick response and help. I tried PSPNet, Deeplabv3 and FCN , all of them with ResNet backbone with depth 18. According to the config tables all three of them are suppose to use GPU memory less than 2 GBs. Strangely I got the same GPU memory error with PSPNet and Deeplabv3 but FCN is working now. Although different models but with the same backbone and depth the allocated memory should be on the same level right? Unless the listed memory usage in the config tables is per GPU (you used 4 in your trainings if I'm correct) but that should have been modified with the batch_size. is there a way that I can modify memory usage and trade it with speed? the card that I have now has 6 GB of memory and still problematic. I appreciate it very much if you give me your insights on this.
Maybe you could try FP16, see this config for example: https://github.com/open-mmlab/mmsegmentation/blob/master/configs/bisenetv2/bisenetv2_fcn_fp16_4x4_1024x1024_160k_cityscapes.py
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
I'm trying to finetune a food segmentation model found here, on new dataset.
When trying to train the model. I got this error. The
batch_size
is set to 1.Thank you in advance for any insights you can give.
Reproduction
Command
Configuration file
I used this japanese dataset for food segmentation: https://mm.cs.uec.ac.jp/uecfoodpix/ I go the model from Foodseg repo and tried to finetune it on the japanese data.
Environment
Error traceback