jbwang1997 / OBBDetection

OBBDetection is an oriented object detection library, which is based on MMdetection.
Apache License 2.0
525 stars 112 forks source link

RuntimeError: CUDA error: out of memory #41

Closed wang11wei closed 2 years ago

wang11wei commented 2 years ago

问题解决: 原因:mmcv(0.6.2) 中的log脚本中 device=torch.device('cuda')) ,这会导致 tensor 传到卡 0 上,如果它被占用就会导致这个错误 解决方法:

  1. 使用卡0(最前的那张卡)进行训练
  2. 修改 mmcv 源码
  3. 可以尝试更新 mmcv 看看能不能解决

服务器上有4块卡,但我只使用1块(32GB)来训练自己用tools.img_split.py 来切出来的数据集("rates": [0.5, 1.0, 1.5])。 我设置

运行 train.py 后报如下错误:

Traceback (most recent call last): File "/disk_sda/wangwei/OBBDetection-master/tools/train.py", line 153, in main() File "/disk_sda/wangwei/OBBDetection-master/tools/train.py", line 149, in main meta=meta) File "/disk_sda/wangwei/OBBDetection-master/mmdet/apis/train.py", line 129, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run epoch_runner(data_loaders[i], **kwargs) File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train self.call_hook('after_train_iter') File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook getattr(hook, fn_name)(self) File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/base.py", line 53, in after_train_iter self.log(runner) File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/text.py", line 169, in log log_dict['memory'] = self._get_max_memory(runner) File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/text.py", line 56, in _get_max_memory device=torch.device('cuda')) RuntimeError: CUDA error: out of memory

Process finished with exit code 1

jbwang1997 commented 2 years ago

训练的是什么模型,多尺度训练时有可能因为单张图片上的目标数量过多造成IoU计算的时候需要大量现存,可以在config中设置gpu_assign_thr将IoU计算放在cpu上进行

wang11wei commented 2 years ago

训练的是什么模型,多尺度训练时有可能因为单张图片上的目标数量过多造成IoU计算的时候需要大量现存,可以在config中设置gpu_assign_thr将IoU计算放在cpu上进行

  1. 训练的是 faster_rcnn_obb_r50_fpn_1x_dota10.py
  2. 由于图像中目标非常少,每张图片不超过 20 个目标,我尝试将其设置为10,仍然未解决

补充说明:观察该卡显存占用,其逐渐上升到大概 4Gb 左右占用后,就会报出该异常

jbwang1997 commented 2 years ago

这种情况这边还没有遇到过,可能是显卡配置方面问题,一般这种配置不会出现现存不够的情况。

wang11wei commented 2 years ago

这种情况这边还没有遇到过,可能是显卡配置方面问题,一般这种配置不会出现现存不够的情况。

好的 谢谢回复