shaunyuan22 / SODA-mmrotate

SODA-A Small Object Detection Toolbox and Benchmark
https://shaunyuan22.github.io/SODA/
Apache License 2.0
37 stars 6 forks source link

[Bug] AssertionError: loss log variables are different across GPUs! #2

Closed bozihu closed 1 year ago

bozihu commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.10.1+cu111 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.2+cu111 OpenCV: 4.7.0 MMCV: 1.6.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMRotate: 0.3.3+04da23d

Reproduces the problem - code sample

Traceback (most recent call last): File "./tools/train.py", line 192, in main() File "./tools/train.py", line 181, in main train_detector( File "/data1/code/SODA-mmrotate/mmrotate/apis/train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(*inputs[0], *kwargs[0]) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 249, in train_step loss, log_vars = self._parse_losses(losses) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 208, in _parse_losses assert log_var_length == len(log_vars) dist.get_world_size(), \ AssertionError: loss log variables are different across GPUs! rank 7 len(log_vars): 2 keys: loss_cls,loss_bbox

Reproduces the problem - command or script

when I inference the test data, the error occurs during the merge process.

Reproduces the problem - error message

Traceback (most recent call last): File "./tools/train.py", line 192, in main() File "./tools/train.py", line 181, in main train_detector( File "/data1/code/SODA-mmrotate/mmrotate/apis/train.py", line 141, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(*inputs[0], *kwargs[0]) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 249, in train_step loss, log_vars = self._parse_losses(losses) File "/home/anaconda3/envs/SODA_rotate/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 208, in _parse_losses assert log_var_length == len(log_vars) dist.get_world_size(), \ AssertionError: loss log variables are different across GPUs! rank 7 len(log_vars): 2 keys: loss_cls,loss_bbox

Additional information

No response

shaunyuan22 commented 1 year ago

Maybe there are empty images which with no available annotations. You can try other models and see whether this error still occurs. btw, more information like the model settings here maybe helpful