open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
28.91k stars 9.35k forks source link

concat 2 diff types dataset train ,val and test issue #8890

Open Evil1m opened 1 year ago

Evil1m commented 1 year ago

Prerequisite

🐞 Describe the bug

I got 2 dataset Dataset_A have 15 classes with COCO format Dataset_B have 20 classes with VOC format

I followed the format in https://mmdetection.readthedocs.io/en/latest/tutorials/customize_dataset.html#concatenate-dataset note part to create a concated dataset and the config like this:

img_norm_cfg = dict(
          mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
      train_pipeline = [
          dict(type='LoadImageFromFile'),
          dict(type='LoadAnnotations', with_bbox=True),
          dict(type='Resize', img_scale=(1024, 1024), keep_ratio=True),
          dict(type='RandomFlip', flip_ratio=0.5),
          dict(type='Normalize', **img_norm_cfg),
          dict(type='Pad', size_divisor=32),
          dict(type='DefaultFormatBundle'),
          dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
      ]
      test_pipeline = [
          dict(type='LoadImageFromFile'),
          dict(
              type='MultiScaleFlipAug',
              img_scale=(1024, 1024),
              flip=False,
              transforms=[
                  dict(type='Resize', keep_ratio=True),
                  dict(type='RandomFlip'),
                  dict(type='Normalize', **img_norm_cfg),
                  dict(type='Pad', size_divisor=32),
                  dict(type='ImageToTensor', keys=['img']),
                  dict(type='Collect', keys=['img']),
              ])
      ]
      dota_train_dict=dict(
          type='RepeatDataset',
          times=1,
          dataset=dict(
              type='CocoDataset',
              ann_file='ann_path/train_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=train_pipeline
          )
      )
      dota_val_dict=dict(
              type='CocoDataset',
              ann_file='ann_path/val_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=test_pipeline    
      )
      dota_test_dict=dict(
              type='CocoDataset',
              ann_file='ann_path/val_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=test_pipeline

      )
      mar20_train_dict=dict(
          type='RepeatDataset',
          times=3,
          dataset=dict(
              type='VOCDataset',
              ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
              img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
              pipeline=train_pipeline
          )
      )
      # mar20_val_dict=dict(
      #     dataset=dict(
      #         type='VOCDataset',
      #        ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
      #        img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
      #         pipeline=test_pipeline
      #     )    
      # )
      # mar20_test_dict=dict(
      #     dataset=dict(
      #         type='VOCDataset',
      #        ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
      #        img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
      #         pipeline=test_pipeline
      #     )    
      # )
      data = dict(
          samples_per_gpu=4,
          workers_per_gpu=2,
          train=[dota_train_dict,mar20_train_dict],
          val=dota_val_dict,
          test=dota_test_dict
          )
      evaluation = dict(interval=1, metric='bbox')

I try to train this concat in a normal faster rcnn r50 fpn network. so i set the num_classes to a+b which is 35 here emobile_2022-09-27_15-16-00

then i change both /mmdet/datasets/coco.py and /mmdet/datasets/voc.py file CLASSES part to emobile_2022-09-27_15-19-07 which also num = 35

CASE1: when I started training, everything was normal ,but i will got issue when this 1st training process and the 1st val process finished.

2022-09-27 15:04:44,246 - mmdet - INFO - Epoch [1][1850/1898]   lr: 1.000e-04, eta: 3:05:32, time: 0.528, data_time: 0.022, memory: 8168, loss_rpn_cls: 0.1822, loss_rpn_bbox: 0.0940, loss_cls: 0.3737, acc: 91.3897, loss_bbox: 0.2509, loss: 0.9008
2022-09-27 15:05:16,011 - mmdet - INFO - Saving checkpoint at 1 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 10131/10130, 42.3 task/s, elapsed: 240s, ETA:     0s

Traceback (most recent call last):
  File "/home/xxxxx/xxxxx/mmdetection/./tools/train.py", line 185, in <module>
    main()
  File "/home/xxxxx/xxxxx/mmdetection/./tools/train.py", line 174, in main
    train_detector(
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/apis/train.py", line 203, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/xxxxx/anaconda3/envs/openmmlab/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/xxxxx/anaconda3/envs/openmmlab/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/xxxxx/anaconda3/envs/openmmlab/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/xxxxx/anaconda3/envs/openmmlab/lib/python3.9/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 123, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/home/xxxxx/anaconda3/envs/openmmlab/lib/python3.9/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate
    eval_res = self.dataloader.dataset.evaluate(
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/datasets/coco.py", line 445, in evaluate
    result_files, tmp_dir = self.format_results(results, jsonfile_prefix)
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/datasets/coco.py", line 390, in format_results
    result_files = self.results2json(results, jsonfile_prefix)
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/datasets/coco.py", line 322, in results2json
    json_results = self._det2json(results)
  File "/home/xxxxx/xxxxx/mmdetection/mmdet/datasets/coco.py", line 259, in _det2json
    data['category_id'] = self.cat_ids[label]
IndexError: list index out of range

list out of range here

CASE2: so I change the faster rcnn r50 fpn num_classes to 15 which equal to dataset_a class num at the same time i change the /mmdet/dataset/coco.py CLASSES part to only include dataset_A classes and the /mmdet/dataset/voc.py CLASSES part to only include dataset_B classes

but i still got issue

Done (t=0.94s)
creating index...
Done (t=0.96s)
creating index...
index created!
index created!
2022-09-27 15:40:25,563 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:194: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [0,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:194: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [1,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:194: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [2,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/home/shuzhilian/anaconda3/envs/openmmlab/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

CASE3: SO I TRY LIKE THIS WAY and this num_classes still 35

train_class=('plane', 'baseball-diamond',"bridge","ground-track-field", "small-vehicle", "large-vehicle","ship", "tennis-court", 
              "basketball-court","storage-tank","soccer-ball-field","roundabout","harbor","swimming-pool","helicopter",'A1', 'A2', 'A3', 'A4', 
               'A5', 'A6', 'A7',
               'A8', 'A9', 'A10', 'A11', 'A12', 'A13',
               'A14', 'A15', 'A16', 'A17', 'A18', 'A19',
               'A20')
dota_class=('plane', 'baseball-diamond',"bridge","ground-track-field", "small-vehicle", "large-vehicle","ship", "tennis-court", 
              "basketball-court","storage-tank","soccer-ball-field","roundabout","harbor","swimming-pool","helicopter")
mar20_class=('A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7',
               'A8', 'A9', 'A10', 'A11', 'A12', 'A13',
               'A14', 'A15', 'A16', 'A17', 'A18', 'A19',
               'A20')
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1024, 1024), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
dota_train_dict=dict(
          type='RepeatDataset',
          times=1,
          dataset=dict(
              type='CocoDataset',
              classes=train_class,
              ann_file='ann_path/train_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=train_pipeline
          )
      )
      dota_val_dict=dict(
              type='CocoDataset',
               classes=dota_class,
              ann_file='ann_path/val_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=test_pipeline    
      )
      dota_test_dict=dict(
              type='CocoDataset',
               classes=dota_class,
              ann_file='ann_path/val_ann.json',
              img_prefix=img_prefix/images/',
              pipeline=test_pipeline

      )
      mar20_train_dict=dict(
          type='RepeatDataset',
          times=3,
          dataset=dict(
              type='VOCDataset',
               classes=train_class,
              ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
              img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
              pipeline=train_pipeline
          )
      )
      # mar20_val_dict=dict(
      #     dataset=dict(
      #         type='VOCDataset',
      #        ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
      #        img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
      #         pipeline=test_pipeline
      #     )    
      # )
      # mar20_test_dict=dict(
      #     dataset=dict(
      #         type='VOCDataset',
      #        ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
      #        img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
      #         pipeline=test_pipeline
      #     )    
      # )
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=2,
    train=[dota_train_dict,mar20_train_dict],
    val=dota_val_dict,
    test=dota_test_dict
    )
evaluation = dict(interval=1, metric='bbox')

IT SHOWS THE SAME PROBLEM AT CASE1

I really wonder how to correctly concat two different types of datasets for normal training, val and testing processes and another question is : it seem that i can not val and test this two different types datasets together after each epoch finished use the config below:

       mar20_val_dict=dict(
           dataset=dict(
               type='VOCDataset',
              ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
             img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
              pipeline=test_pipeline
           )    
       )
       mar20_test_dict=dict(
           dataset=dict(
               type='VOCDataset',
              ann_file='xxxxxxxxxx/ImageSets/Main/train.txt',
              img_prefix='xxxxxxxxxxxxxxxx/JPEGImage/',
               pipeline=test_pipeline
           )    
       )
      data = dict(
          samples_per_gpu=4,
          workers_per_gpu=2,
          train=[dota_train_dict,mar20_train_dict],
          val=[dota_val_dict,mar20_val_dict]
          test=[dota_test_dict,mar20_test_dict]
          )
      evaluation = dict(interval=1, metric='bbox')
      evaluation = dict(interval=1, metric='mAP')

Environment

fatal: not a git repository (or any parent up to mount point /) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). sys.platform: linux Python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] CUDA available: True GPU 0,2,3,4,6,7: TITAN RTX GPU 1,5: Tesla P40 CUDA_HOME: :/usr/local/cuda-10.2 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~19.10) 7.5.0 PyTorch: 1.10.0+cu102 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.1+cu102 OpenCV: 4.5.4 MMCV: 1.4.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.19.0+

Additional information

No response

hhaAndroid commented 1 year ago

@Evil1m This problem is a bit complicated and not so easy to deal with. I am thinking about it, thanks for the feedback

jacoverster commented 1 year ago

Hi, any feedback on this issue? I also have the same issue combining two datasets with one class each. It works when I set num_classes = 1, but that leaves my model with one class? If I set num_classes = 2 I also get the same IndexError.

jacoverster commented 1 year ago

dota_val_dict=dict( type='CocoDataset', ann_file='ann_path/val_ann.json', img_prefix=img_prefix/images/', pipeline=test_pipeline
)

Ok, I seemed to have found my issue - when concatenating the two dataset validation splits I needed to make sure both categories were listed in the val.json file for each dataset, even though the dataset does not include the class, for example:

"categories": [
        {
            "id": 0,
            "name": "Class from dataset A"
        },
        {
            "id": 1,
            "name": "Class from dataset B"
        }
    ],
...