Closed wang11wei closed 2 years ago
训练的是什么模型,多尺度训练时有可能因为单张图片上的目标数量过多造成IoU计算的时候需要大量现存,可以在config中设置gpu_assign_thr将IoU计算放在cpu上进行
训练的是什么模型,多尺度训练时有可能因为单张图片上的目标数量过多造成IoU计算的时候需要大量现存,可以在config中设置gpu_assign_thr将IoU计算放在cpu上进行
补充说明:观察该卡显存占用,其逐渐上升到大概 4Gb 左右占用后,就会报出该异常
这种情况这边还没有遇到过,可能是显卡配置方面问题,一般这种配置不会出现现存不够的情况。
这种情况这边还没有遇到过,可能是显卡配置方面问题,一般这种配置不会出现现存不够的情况。
好的 谢谢回复
问题解决: 原因:mmcv(0.6.2) 中的log脚本中 device=torch.device('cuda')) ,这会导致 tensor 传到卡 0 上,如果它被占用就会导致这个错误 解决方法:
服务器上有4块卡,但我只使用1块(32GB)来训练自己用tools.img_split.py 来切出来的数据集("rates": [0.5, 1.0, 1.5])。 我设置
运行 train.py 后报如下错误:
Traceback (most recent call last): File "/disk_sda/wangwei/OBBDetection-master/tools/train.py", line 153, in
main()
File "/disk_sda/wangwei/OBBDetection-master/tools/train.py", line 149, in main
meta=meta)
File "/disk_sda/wangwei/OBBDetection-master/mmdet/apis/train.py", line 129, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 43, in train
self.call_hook('after_train_iter')
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 282, in call_hook
getattr(hook, fn_name)(self)
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/base.py", line 53, in after_train_iter
self.log(runner)
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/text.py", line 169, in log
log_dict['memory'] = self._get_max_memory(runner)
File "/home/kqgis/anaconda3/envs/obbdetection/lib/python3.7/site-packages/mmcv/runner/hooks/logger/text.py", line 56, in _get_max_memory
device=torch.device('cuda'))
RuntimeError: CUDA error: out of memory
Process finished with exit code 1