CUDA out of memory - Githubissues

thundercult commented 1 year ago

Hello,thanks for your sharing very much.When I tried to run the train.py,the Error always happened no matter how many Gpus I used. It is strange that different Gpus require different amounts of memory.The best Gpu I used is 4 NVIDIA A100 . Training is good but on the fifth iteration evaluation loop started, the error always arised. Do you know how to fix it?

hsiangyuzhao commented 1 year ago

Hi, @thundercult , could you provide the full traceback for the error? So I can look into it.

thundercult commented 1 year ago

Of course,do you mind I add your wechat or any else way to contact with you?That will be more convenient.

hsiangyuzhao commented 1 year ago

I think it would be better if we discuss here so that maybe other people will benefit from what we have discussed. From the traceback, the error happens with sliding_window_inference function of package MONAI. This function requires an argment sw_batch_size, which means the batch size of the patches during inference. Setting this value too high can also raise a CUDA OOM error, maybe you can look into it.

thundercult commented 1 year ago

Thank you very much. traceback was accidentally deleted by me. I'll put it down.

Traceback (most recent call last): File "/home3/@/RCPS-main/train.py", line 186, in main() File "/home3/@/RCPS-main/train.py", line 161, in main model.evaluate_one_step(True if (epoch + 1) % args.save_interval == 0 else False, File "/home3/@/RCPS-main/models/segmentation_models.py", line 305, in evaluate_one_step self.pred_l = sliding_window_inference(self.image_l, roi_size=self.cfg['TEST']['PATCH_SIZE'], File "/home3/@/RCPS/lib/python3.10/site-packages/monai/inferers/utils.py", line 185, in sliding_window_inference seg_prob_out = predictor(window_data, *args, kwargs) # batched patch segmentation File "/home3/@/RCPS-main/models/segmentation_models.py", line 136, in predictor output = self.network(inputs)['out'] File "/home3/@/RCPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home3/@/RCPS/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(inputs, kwargs) File "/home3/@/RCPS/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], kwargs[0]) File "/home3/@/RCPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home3/@/RCPS-main/models/networks.py", line 136, in forward x0_4 = self.conv0_4(torch.cat([x0, self.upsample(x1_3, x0)], dim=1)) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home3/@/RCPS-main/base/base_modules.py", line 412, in forward identity = self.id(identity) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home3/@/RCPS-main/base/base_modules.py", line 362, in forward return self.basic_forward(x) File "/home3/@/RCPS-main/base/base_modules.py", line 348, in basic_forward x = self.norm(x) File "/home3/@/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 740, in forward return F.batch_norm( File "/home3/@/lib/python3.10/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 39.42 GiB total capacity; 26.91 GiB already allocated; 243.06 MiB free; 27.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1162259) of binary: /@/RCPS/bin/python Traceback (most recent call last): File "/home3/@/RCPS/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home3/@/envs/RCPS/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

thundercult commented 1 year ago

Sorry to bother you again,I have changed the sw_batch_size from 4 to 1,but the error still arise(cuda out of memory).Whether this error is related to DDP of torch or not?

hsiangyuzhao commented 1 year ago

Perhaps? My experiments are carried out on 2 RTX 3090 and it works fine in my case. Maybe you can test running with fewer GPUs and see if the problem still exists.

ChenZhuYuam commented 1 year ago

Hello,thanks for your sharing very much.And where is the sharpen operation in the code? I've been searching for it for a long time but I haven't found it.

hsiangyuzhao commented 1 year ago

Hello,thanks for your sharing very much.And where is the sharpen operation in the code? I've been searching for it for a long time but I haven't found it.
pseudo_label = F.softmax(targets / self.cfg['TRAIN']['TEMP'], dim=1).detach()

hsiangyuzhao / RCPS

CUDA out of memory #2