I trained on my own dataset using mmrotate branch 1.x , during the training phase some batch data result in loss nan. so I modified the mmengine code '/usr/local/python3/lib/python3.7/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py', I add 'torch.isnan(loss).any()' to skip the loss.backward() operation if nan appears in loss tensor. And then the training can go normally even if nan happens in one batch.
However after training for 1 or 2 epochs, 'torch.cuda.OutOfMemoryError: CUDA out of memory.' happens. So I wonder how to fix that.
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
1.x branch https://github.com/open-mmlab/mmrotate/tree/1.x
Environment
sys.platform: linux Python: 3.7.16 (default, May 26 2023, 10:49:43) [GCC 11.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA A800 80GB PCIe CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.7, V11.7.99 GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 PyTorch: 1.13.0+cu117 PyTorch compiling details: PyTorch built with:
TorchVision: 0.14.0+cu117 OpenCV: 4.7.0 MMEngine: 0.7.3 MMRotate: 1.0.0rc1+unknown
Reproduces the problem - code sample
I trained on my own dataset using mmrotate branch 1.x , during the training phase some batch data result in loss nan. so I modified the mmengine code '/usr/local/python3/lib/python3.7/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py', I add 'torch.isnan(loss).any()' to skip the loss.backward() operation if nan appears in loss tensor. And then the training can go normally even if nan happens in one batch. However after training for 1 or 2 epochs, 'torch.cuda.OutOfMemoryError: CUDA out of memory.' happens. So I wonder how to fix that.
Reproduces the problem - command or script
python3 tools/train.py configs/oriented_rcnn/oriented-rcnn-le90_swin-tiny_fpn_1x_dota.py --work-dir exp0
Reproduces the problem - error message
cuda memory keeps going up
Additional information
No response