open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.5k stars 9.45k forks source link

Not on the same device error in PointAssigner #10436

Open hujh1994 opened 1 year ago

hujh1994 commented 1 year ago

Use tools/train.py to train RepPoints Network, and you will get RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) The last lines of Traceback is File "xxxxxxxx\mmdetection\mmdet\models\task_modules\assigners\point_assigner.py", line 114, in assign points_index = points_range[lvl_idx] According to my Github Desktop and memory, I have not modified any code of mmdetection. I use CityPersons Dataset. However, I believe the problem is not relative to the dataset.

Environment sys.platform: win32 Python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.124 MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.35.32217.1 版 GCC: n/a PyTorch: 1.13.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.14.1 OpenCV: 4.7.0 MMEngine: 0.7.3 MMDetection: 3.0.0+ecac3a7

Error traceback Traceback (most recent call last): File "path_to_my_project\train.py", line 133, in main() File "path_to_my_project\train.py", line 129, in main runner.train() File "C:\Users\HuShi\anaconda3\lib\site-packages\mmengine\runner\runner.py", line 1721, in train model = self.train_loop.run() # type: ignore File "C:\Users\HuShi\anaconda3\lib\site-packages\mmengine\runner\loops.py", line 278, in run self.run_iter(data_batch) File "C:\Users\HuShi\anaconda3\lib\site-packages\mmengine\runner\loops.py", line 301, in run_iter outputs = self.runner.model.train_step( File "C:\Users\HuShi\anaconda3\lib\site-packages\mmengine\model\base_model\base_model.py", line 114, in train_step losses = self._run_forward(data, mode='loss') # type: ignore File "C:\Users\HuShi\anaconda3\lib\site-packages\mmengine\model\base_model\base_model.py", line 340, in _run_forward results = self(*data, mode=mode) File "C:\Users\HuShi\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(input, *kwargs) File "xxxxx\mmdetection\mmdet\models\detectors\base.py", line 92, in forward return self.loss(inputs, data_samples) File "xxxxx\mmdetection\mmdet\models\detectors\single_stage.py", line 78, in loss losses = self.bbox_head.loss(x, batch_data_samples) File "xxxxx\mmdetection\mmdet\models\dense_heads\base_dense_head.py", line 123, in loss losses = self.loss_by_feat(loss_inputs) File "xxxxx\mmdetection\mmdet\models\dense_heads\reppoints_head.py", line 705, in loss_by_feat cls_reg_targets_init = self.get_targets( File "xxxxx\mmdetection\mmdet\models\dense_heads\reppoints_head.py", line 561, in get_targets sampling_results_list) = multi_apply( File "xxxxx\mmdetection\mmdet\models\utils\misc.py", line 219, in multi_apply return tuple(map(list, zip(*map_results))) File "xxxxx\warpnet\mmdetection\mmdet\models\dense_heads\reppoints_head.py", line 444, in _get_targets_single
assign_result = assigner.assign(pred_instances, gt_instances, File "xxxxx\warpnet\mmdetection\mmdet\models\task_modules\assigners\point_assigner.py", line 114, in assign
points_index = points_range[lvl_idx] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Bug fix Initializing points_range with the device of points_lvl will solve the problem. I will create a PR to fix it.

shankar-vision-eng commented 11 months ago

I have run into the same issue, looks like an upgrade to torch caused this issue as issue as not present in torch1.10. This is the check in torch that throws the error. Any way this can be fixed in next release ?