open-mmlab / mmselfsup

OpenMMLab Self-Supervised Learning Toolbox and Benchmark
https://mmselfsup.readthedocs.io/en/latest/
Apache License 2.0
3.2k stars 432 forks source link

[Bug] RuntimeError: each data_itement in list of batch should be of equal size #729

Closed Acuhe closed 1 year ago

Acuhe commented 1 year ago

Branch

1.x branch (1.x version, such as v1.0.0rc2, or dev-1.x branch)

Prerequisite

Environment

System environment: sys.platform: linux Python: 3.9.16 (main, Dec 7 2022, 01:11:51) [GCC 9.4.0] CUDA available: True numpy_random_seed: 1816210116 GPU 0: Tesla T4 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.13.1+cu116 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None Distributed launcher: pytorch Distributed training: True GPU number: 1

Describe the bug

The model and loaded state dict do not match exactly RuntimeError: Caught RuntimeError in DataLoader worker process 0. RuntimeError: each data_itement in list of batch should be of equal size

I followed the instruction of evaluating the pre-trained model loaded from model zoo on VOC detection tasks. I can't figure out the source of "Runtime error". I wonder if you could tell me how to check the size of "data_itement". In addtion, the loaded DenseCL model does not match the detection model.

Reproduces the problem - code sample

%%writefile /content/mmselfsup/configs/benchmarks/mmdetection/voc0712/faster_rcnn_r50_c4_ms-3k_voc0712_test.py base = 'faster-rcnn_r50-c4_ms-24k_voc0712.py'

optim_wrapper = dict(

type='OptimWrapper',

optimizer=dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001))

optim_wrapper = dict(optimizer=dict(lr=0.02 * (1 / 8)))

data_root = 'data/VOCdevkit/' train_dataloader = dict( sampler=dict(type='InfiniteSampler', shuffle=True), dataset=dict( delete=True, type='ConcatDataset', ignore_keys=['dataset_type'], datasets=[ dict( type='VOCDataset', data_root=data_root, ann_file='VOC2007/ImageSets/Main/trainval.txt', data_prefix=dict(sub_data_root='VOC2007/'), filter_cfg=dict(filter_empty_gt=True, min_size=32), ), dict( type='VOCDataset', data_root=data_root, ann_file='VOC2012/ImageSets/Main/trainval.txt', data_prefix=dict(sub_data_root='VOC2012/'), filter_cfg=dict(filter_empty_gt=True, min_size=32), ) ]))

custom_imports = dict( imports=['mmselfsup.models.utils.res_layer_extra_norm_mine'], # modified extranorm allow_failed_imports=False)

!bash tools/benchmarks/mmdetection/mim_dist_train_c4.sh \ configs/benchmarks/mmdetection/voc0712/faster_rcnn_r50_c4_ms-3k_voc0712_test.py \ /content/mmselfsup/checkpoints/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth \ 1

Reproduces the problem - command or script

No response

Reproduces the problem - error message

03/27 11:06:00 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
03/27 11:06:00 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
03/27 11:06:05 - mmengine - INFO - load backbone. in model from: /content/mmselfsup/checkpoints/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
Loads checkpoint by local backend from path: /content/mmselfsup/checkpoints/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
03/27 11:06:06 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: layer4.0.conv1.weight, layer4.0.bn1.weight, layer4.0.bn1.bias, layer4.0.bn1.running_mean, layer4.0.bn1.running_var, layer4.0.bn1.num_batches_tracked, layer4.0.conv2.weight, layer4.0.bn2.weight, layer4.0.bn2.bias, layer4.0.bn2.running_mean, layer4.0.bn2.running_var, layer4.0.bn2.num_batches_tracked, layer4.0.conv3.weight, layer4.0.bn3.weight, layer4.0.bn3.bias, layer4.0.bn3.running_mean, layer4.0.bn3.running_var, layer4.0.bn3.num_batches_tracked, layer4.0.downsample.0.weight, layer4.0.downsample.1.weight, layer4.0.downsample.1.bias, layer4.0.downsample.1.running_mean, layer4.0.downsample.1.running_var, layer4.0.downsample.1.num_batches_tracked, layer4.1.conv1.weight, layer4.1.bn1.weight, layer4.1.bn1.bias, layer4.1.bn1.running_mean, layer4.1.bn1.running_var, layer4.1.bn1.num_batches_tracked, layer4.1.conv2.weight, layer4.1.bn2.weight, layer4.1.bn2.bias, layer4.1.bn2.running_mean, layer4.1.bn2.running_var, layer4.1.bn2.num_batches_tracked, layer4.1.conv3.weight, layer4.1.bn3.weight, layer4.1.bn3.bias, layer4.1.bn3.running_mean, layer4.1.bn3.running_var, layer4.1.bn3.num_batches_tracked, layer4.2.conv1.weight, layer4.2.bn1.weight, layer4.2.bn1.bias, layer4.2.bn1.running_mean, layer4.2.bn1.running_var, layer4.2.bn1.num_batches_tracked, layer4.2.conv2.weight, layer4.2.bn2.weight, layer4.2.bn2.bias, layer4.2.bn2.running_mean, layer4.2.bn2.running_var, layer4.2.bn2.num_batches_tracked, layer4.2.conv3.weight, layer4.2.bn3.weight, layer4.2.bn3.bias, layer4.2.bn3.running_mean, layer4.2.bn3.running_var, layer4.2.bn3.num_batches_tracked

03/27 11:06:06 - mmengine - INFO - load backbone. in model from: /content/mmselfsup/checkpoints/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
Loads checkpoint by local backend from path: /content/mmselfsup/checkpoints/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
03/27 11:06:06 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: conv1.weight, bn1.weight, bn1.bias, bn1.running_mean, bn1.running_var, bn1.num_batches_tracked, layer1.0.conv1.weight, layer1.0.bn1.weight, layer1.0.bn1.bias, layer1.0.bn1.running_mean, layer1.0.bn1.running_var, layer1.0.bn1.num_batches_tracked, layer1.0.conv2.weight, layer1.0.bn2.weight, layer1.0.bn2.bias, layer1.0.bn2.running_mean, layer1.0.bn2.running_var, layer1.0.bn2.num_batches_tracked, layer1.0.conv3.weight, layer1.0.bn3.weight, layer1.0.bn3.bias, layer1.0.bn3.running_mean, layer1.0.bn3.running_var, layer1.0.bn3.num_batches_tracked, layer1.0.downsample.0.weight, layer1.0.downsample.1.weight, layer1.0.downsample.1.bias, layer1.0.downsample.1.running_mean, layer1.0.downsample.1.running_var, layer1.0.downsample.1.num_batches_tracked, layer1.1.conv1.weight, layer1.1.bn1.weight, layer1.1.bn1.bias, layer1.1.bn1.running_mean, layer1.1.bn1.running_var, layer1.1.bn1.num_batches_tracked, layer1.1.conv2.weight, layer1.1.bn2.weight, layer1.1.bn2.bias, layer1.1.bn2.running_mean, layer1.1.bn2.running_var, layer1.1.bn2.num_batches_tracked, layer1.1.conv3.weight, layer1.1.bn3.weight, layer1.1.bn3.bias, layer1.1.bn3.running_mean, layer1.1.bn3.running_var, layer1.1.bn3.num_batches_tracked, layer1.2.conv1.weight, layer1.2.bn1.weight, layer1.2.bn1.bias, layer1.2.bn1.running_mean, layer1.2.bn1.running_var, layer1.2.bn1.num_batches_tracked, layer1.2.conv2.weight, layer1.2.bn2.weight, layer1.2.bn2.bias, layer1.2.bn2.running_mean, layer1.2.bn2.running_var, layer1.2.bn2.num_batches_tracked, layer1.2.conv3.weight, layer1.2.bn3.weight, layer1.2.bn3.bias, layer1.2.bn3.running_mean, layer1.2.bn3.running_var, layer1.2.bn3.num_batches_tracked, layer2.0.conv1.weight, layer2.0.bn1.weight, layer2.0.bn1.bias, layer2.0.bn1.running_mean, layer2.0.bn1.running_var, layer2.0.bn1.num_batches_tracked, layer2.0.conv2.weight, layer2.0.bn2.weight, layer2.0.bn2.bias, layer2.0.bn2.running_mean, layer2.0.bn2.running_var, layer2.0.bn2.num_batches_tracked, layer2.0.conv3.weight, layer2.0.bn3.weight, layer2.0.bn3.bias, layer2.0.bn3.running_mean, layer2.0.bn3.running_var, layer2.0.bn3.num_batches_tracked, layer2.0.downsample.0.weight, layer2.0.downsample.1.weight, layer2.0.downsample.1.bias, layer2.0.downsample.1.running_mean, layer2.0.downsample.1.running_var, layer2.0.downsample.1.num_batches_tracked, layer2.1.conv1.weight, layer2.1.bn1.weight, layer2.1.bn1.bias, layer2.1.bn1.running_mean, layer2.1.bn1.running_var, layer2.1.bn1.num_batches_tracked, layer2.1.conv2.weight, layer2.1.bn2.weight, layer2.1.bn2.bias, layer2.1.bn2.running_mean, layer2.1.bn2.running_var, layer2.1.bn2.num_batches_tracked, layer2.1.conv3.weight, layer2.1.bn3.weight, layer2.1.bn3.bias, layer2.1.bn3.running_mean, layer2.1.bn3.running_var, layer2.1.bn3.num_batches_tracked, layer2.2.conv1.weight, layer2.2.bn1.weight, layer2.2.bn1.bias, layer2.2.bn1.running_mean, layer2.2.bn1.running_var, layer2.2.bn1.num_batches_tracked, layer2.2.conv2.weight, layer2.2.bn2.weight, layer2.2.bn2.bias, layer2.2.bn2.running_mean, layer2.2.bn2.running_var, layer2.2.bn2.num_batches_tracked, layer2.2.conv3.weight, layer2.2.bn3.weight, layer2.2.bn3.bias, layer2.2.bn3.running_mean, layer2.2.bn3.running_var, layer2.2.bn3.num_batches_tracked, layer2.3.conv1.weight, layer2.3.bn1.weight, layer2.3.bn1.bias, layer2.3.bn1.running_mean, layer2.3.bn1.running_var, layer2.3.bn1.num_batches_tracked, layer2.3.conv2.weight, layer2.3.bn2.weight, layer2.3.bn2.bias, layer2.3.bn2.running_mean, layer2.3.bn2.running_var, layer2.3.bn2.num_batches_tracked, layer2.3.conv3.weight, layer2.3.bn3.weight, layer2.3.bn3.bias, layer2.3.bn3.running_mean, layer2.3.bn3.running_var, layer2.3.bn3.num_batches_tracked, layer3.0.conv1.weight, layer3.0.bn1.weight, layer3.0.bn1.bias, layer3.0.bn1.running_mean, layer3.0.bn1.running_var, layer3.0.bn1.num_batches_tracked, layer3.0.conv2.weight, layer3.0.bn2.weight, layer3.0.bn2.bias, layer3.0.bn2.running_mean, layer3.0.bn2.running_var, layer3.0.bn2.num_batches_tracked, layer3.0.conv3.weight, layer3.0.bn3.weight, layer3.0.bn3.bias, layer3.0.bn3.running_mean, layer3.0.bn3.running_var, layer3.0.bn3.num_batches_tracked, layer3.0.downsample.0.weight, layer3.0.downsample.1.weight, layer3.0.downsample.1.bias, layer3.0.downsample.1.running_mean, layer3.0.downsample.1.running_var, layer3.0.downsample.1.num_batches_tracked, layer3.1.conv1.weight, layer3.1.bn1.weight, layer3.1.bn1.bias, layer3.1.bn1.running_mean, layer3.1.bn1.running_var, layer3.1.bn1.num_batches_tracked, layer3.1.conv2.weight, layer3.1.bn2.weight, layer3.1.bn2.bias, layer3.1.bn2.running_mean, layer3.1.bn2.running_var, layer3.1.bn2.num_batches_tracked, layer3.1.conv3.weight, layer3.1.bn3.weight, layer3.1.bn3.bias, layer3.1.bn3.running_mean, layer3.1.bn3.running_var, layer3.1.bn3.num_batches_tracked, layer3.2.conv1.weight, layer3.2.bn1.weight, layer3.2.bn1.bias, layer3.2.bn1.running_mean, layer3.2.bn1.running_var, layer3.2.bn1.num_batches_tracked, layer3.2.conv2.weight, layer3.2.bn2.weight, layer3.2.bn2.bias, layer3.2.bn2.running_mean, layer3.2.bn2.running_var, layer3.2.bn2.num_batches_tracked, layer3.2.conv3.weight, layer3.2.bn3.weight, layer3.2.bn3.bias, layer3.2.bn3.running_mean, layer3.2.bn3.running_var, layer3.2.bn3.num_batches_tracked, layer3.3.conv1.weight, layer3.3.bn1.weight, layer3.3.bn1.bias, layer3.3.bn1.running_mean, layer3.3.bn1.running_var, layer3.3.bn1.num_batches_tracked, layer3.3.conv2.weight, layer3.3.bn2.weight, layer3.3.bn2.bias, layer3.3.bn2.running_mean, layer3.3.bn2.running_var, layer3.3.bn2.num_batches_tracked, layer3.3.conv3.weight, layer3.3.bn3.weight, layer3.3.bn3.bias, layer3.3.bn3.running_mean, layer3.3.bn3.running_var, layer3.3.bn3.num_batches_tracked, layer3.4.conv1.weight, layer3.4.bn1.weight, layer3.4.bn1.bias, layer3.4.bn1.running_mean, layer3.4.bn1.running_var, layer3.4.bn1.num_batches_tracked, layer3.4.conv2.weight, layer3.4.bn2.weight, layer3.4.bn2.bias, layer3.4.bn2.running_mean, layer3.4.bn2.running_var, layer3.4.bn2.num_batches_tracked, layer3.4.conv3.weight, layer3.4.bn3.weight, layer3.4.bn3.bias, layer3.4.bn3.running_mean, layer3.4.bn3.running_var, layer3.4.bn3.num_batches_tracked, layer3.5.conv1.weight, layer3.5.bn1.weight, layer3.5.bn1.bias, layer3.5.bn1.running_mean, layer3.5.bn1.running_var, layer3.5.bn1.num_batches_tracked, layer3.5.conv2.weight, layer3.5.bn2.weight, layer3.5.bn2.bias, layer3.5.bn2.running_mean, layer3.5.bn2.running_var, layer3.5.bn2.num_batches_tracked, layer3.5.conv3.weight, layer3.5.bn3.weight, layer3.5.bn3.bias, layer3.5.bn3.running_mean, layer3.5.bn3.running_var, layer3.5.bn3.num_batches_tracked

missing keys in source state_dict: norm.weight, norm.bias, norm.running_mean, norm.running_var

03/27 11:06:06 - mmengine - INFO - Checkpoints will be saved to /content/mmselfsup/work_dirs/benchmarks/mmdetection/voc0712/faster_rcnn_r50_c4_ms-3k_voc0712_test/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/mmdet/.mim/tools/train.py", line 124, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/mmdet/.mim/tools/train.py", line 120, in main
    runner.train()
  File "/usr/local/lib/python3.9/dist-packages/mmengine/runner/runner.py", line 1701, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.9/dist-packages/mmengine/runner/loops.py", line 277, in run
    data_batch = next(self.dataloader_iterator)
  File "/usr/local/lib/python3.9/dist-packages/mmengine/runner/loops.py", line 164, in __next__
    data = next(self._iterator)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.9/dist-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.9/dist-packages/mmengine/dataset/utils.py", line 94, in pseudo_collate
    return data_item_type({
  File "/usr/local/lib/python3.9/dist-packages/mmengine/dataset/utils.py", line 95, in <dictcomp>
    key: pseudo_collate([d[key] for d in data_batch])
  File "/usr/local/lib/python3.9/dist-packages/mmengine/dataset/utils.py", line 78, in pseudo_collate
    raise RuntimeError(
RuntimeError: each data_itement in list of batch should be of equal size

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1831) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.9/dist-packages/mmdet/.mim/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-27_11:06:09
  host      : 3c4ddff49514
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1831)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/usr/local/bin/mim", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/mim/commands/train.py", line 100, in cli
    is_success, msg = train(
  File "/usr/local/lib/python3.9/dist-packages/mim/commands/train.py", line 261, in train
    ret = subprocess.check_call(
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)

Additional information

No response

fangyixiao18 commented 1 year ago

As the model in detection is kind of different from the model in self-supervised learning, we load the model separately, thus it prints the warning message showing that it is not match. But it is not an error.

For your error, you could set a breakpoint to check the data_itement before raising the error. Besides, you could also create an issue in mmdet for more detailed help, we will also check it ASAP

Acuhe commented 1 year ago

Sorry to bother you but I don't know how to "set a breakpoint to check the data_itement before raising the error" in the config file. I'd like to print the size of the input data in dataloader, but I have no idea about which variable the batch information or the loaded data is actually stored in ... I would really appreciate it if you could give me some brief guidance! @fangyixiao18