open-mmlab / mmcv

OpenMMLab Computer Vision Foundation
https://mmcv.readthedocs.io/en/latest/
Apache License 2.0
5.87k stars 1.64k forks source link

TypeError: 'DataContainer' object is not iterable #2524

Open PatrickkZ opened 1 year ago

PatrickkZ commented 1 year ago

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.

description

Hi, I'm trying to train the SoloV2 provide by mmdet on CPU with distributed mode. I know it's not recommended to train on CPU, but it's possible to do it? I change some code in tools/train.py to train the model on cpu. Then i got this error

File "/home/zehuan/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 69, in train_step
    output = self.module.train_step(*inputs, **kwargs)
  File "/home/zehuan/projects/ctc/mmdetection/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/home/zehuan/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zehuan/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "/home/zehuan/projects/ctc/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/zehuan/projects/ctc/mmdetection/mmdet/models/detectors/single_stage_instance_seg.py", line 103, in forward_train
    for gt_mask in gt_masks
TypeError: 'DataContainer' object is not iterable

After debugging, is there an inconsistency in MMDDP between CPU and GPU? because i use CPU, so the device_ids in MMDDP is None, then this line will be executed. if it's GPU, the input will be scatter first

if self.device_ids:
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

so i change the line 69 to

inputs, kwargs = self.scatter(inputs, kwargs, [-1])
output = self.module.train_step(*inputs[0], **kwargs[0])

then this can works on CPU while training SoloV2. I want to know if it's right to do this? or how can I train with MMDDP on CPU cluster?

env

HAOCHENYE commented 1 year ago

Thanks for your feedback, and I think this modification is reasonable, would you mind posting a PR to fix this?

PatrickkZ commented 1 year ago

Thanks for your feedback, and I think this modification is reasonable, would you mind posting a PR to fix this?

sure