himashi92 / VT-UNet

[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
MIT License
244 stars 32 forks source link

Still overflowing GPU VRAM with reduced batch size #27

Closed Chadkowski closed 1 year ago

Chadkowski commented 2 years ago

Hi! I'm still having issues with overflowing my VRAM (RTX 3090 24GB) whenever attempting to train. Even after I've redued my batch size to 1. Any ideas on what I can do?

Chadkowski commented 2 years ago

even when running base alone and small alone still overflow. I changed channels to 1 instead of 4 in Swin3D but to no avail

himashi92 commented 2 years ago

Hi, you cannot load both models at once for training. Can you check whether any pids are listed under your GPU? Try this command : sudo fuser -v /dev/nvidia0

himashi92 commented 2 years ago

What is the dataset you are trying to train here? If its brats, then input channel size is 4.

Chadkowski commented 2 years ago

Managed to sort it by changing patch size to 64/128/128 for my set but now I have the following issue..

2022-07-22 15:28:50.920116: epoch: 0 Traceback (most recent call last): File "/home/dawidr/.local/bin/vtunet_train", line 33, in sys.exit(load_entry_point('vtunet', 'console_scripts', 'vtunet_train')()) File "/home/VT-UNet/VTUNet/vtunet/run/run_training.py", line 150, in main trainer.run_training() File "/home/VT-UNet/VTUNet/vtunet/training/network_training/vtunetTrainerV2_vtunet_tumor_base.py", line 430, in run_training ret = super().run_training() File "/home/VT-UNet/VTUNet/vtunet/training/network_training/vtunetTrainer.py", line 319, in run_training super(vtunetTrainer, self).run_training() File "/home/VT-UNet/VTUNet/vtunet/training/network_training/network_trainer.py", line 459, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/VT-UNet/VTUNet/vtunet/training/network_training/vtunetTrainerV2_vtunet_tumor_base.py", line 232, in run_iteration l = self.loss(output, target) File "/home/dawidr/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/VT-UNet/VTUNet/vtunet/training/loss_functions/dice_loss.py", line 352, in forward ce_loss = self.ce(net_output, target[:, 0].long()) if self.weight_ce != 0 else 0 File "/home/dawidr/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/home/VT-UNet/VTUNet/vtunet/training/loss_functions/crossentropy.py", line 12, in forward return super().forward(input, target.long()) File "/home/dawidr/.local/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward return F.cross_entropy(input, target, weight=self.weight, File "/home/dawidr/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: Expected target size [1, 128, 128, 128], got [1, 64, 128, 128]

himashi92 commented 2 years ago

Hi, the current version of VTUNet can only handle the input size of Cx128x128x128. Therefore, the network architecture should be modified accordingly to process the input with the size of Cx64x128x128.