open-mmlab / mmtracking

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.
https://mmtracking.readthedocs.io/en/latest/
Apache License 2.0
3.5k stars 587 forks source link

Running test.py script on a custom COCOVideo Dataset fails #279

Open a-pru opened 2 years ago

a-pru commented 2 years ago

Hi,

I'm trying to use MMTrack on a custom dataset organized as COCOVideo Dataset as shown in the documentation. But when running the tools/test.py script I get an error because the "results" variable in "/mmdetection/mmdet/datasets/pipelines/loading.py" is a string instead of a dict -> loading the dataset somehow fails...

Error traceback:

Original Traceback (most recent call last):
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/databricks/driver/mmdetection/mmdet/datasets/custom.py", line 193, in __getitem__
    return self.prepare_test_img(idx)
  File "/databricks/driver/mmtracking/mmtrack/datasets/coco_video_dataset.py", line 293, in prepare_test_img
    return self.prepare_data(idx)
  File "/databricks/driver/mmtracking/mmtrack/datasets/coco_video_dataset.py", line 269, in prepare_data
    return self.pipeline(results)
  File "/databricks/driver/mmdetection/mmdet/datasets/pipelines/compose.py", line 41, in __call__
    data = t(data)
  File "/databricks/driver/mmdetection/mmdet/datasets/pipelines/loading.py", line 59, in __call__
    if results['img_prefix'] is not None:
TypeError: list indices must be integers or slices, not str

Dataset yaml (only a shortened version):

   "info": {
      "description": "Dataset",
      "version": "0.1",
      "date_created": "2021/09/23"
   },
   "categories": [
      {
         "id": 1,
         "name": "car",
         "supercategory": null
      },
      {
         "id": 2,
         "name": "person",
         "supercategory": null
      }
   ],
   "images": [
      {
         "file_name": "2021-02-12-13-53-43_00000.png",
         "id": "2021-02-12-13-53-43_00000",
         "width": 1280,
         "height": 720,
         "video_id": 0,
         "frame_id": 0
      },
      {
         "file_name": "2021-02-12-13-53-43_00001.png",
         "id": "2021-02-12-13-53-43_00001",
         "width": 1280,
         "height": 720,
         "video_id": 0,
         "frame_id": 1
      }
   ],
   "annotations": [
      {
         "area": 61936,
         "iscrowd": 0,
         "bbox": [
            202,
            18,
            158,
            392
         ],
         "category_id": 1,
         "ignore": 0,
         "segmentation": [ 
         ],
         "image_id": "2021-02-12-13-53-43_00000",
         "video_id": 0,
         "id": 4,
         "instance_id": 1,
         "is_vid_train_frame":  true
      },
      {
         "area": 161036,
         "iscrowd": 0,
         "bbox": [
            963,
            0,
            317,
            508
         ],
         "category_id": 1,
         "ignore": 0,
         "segmentation": [ 
         ],
         "image_id": "2021-02-12-13-53-43_00000",
         "video_id": 0,
         "id": 5,
         "instance_id": 2,
         "is_vid_train_frame":  true
      }
   ],
   "videos": [
      {
         "id": 0,
         "name": "2021-02-12-13-53-43"
      }
   ]
}

I tried to use Tracktor with a standard ReID network and a custom detector which I trained previously using MMDet. Config:

model = dict(
    detector=dict(
        type='FasterRCNN',
        backbone=dict(
            type='ResNet',
            depth=101,
            num_stages=4,
            out_indices=(0, 1, 2, 3),
            frozen_stages=1,
            norm_cfg=dict(type='BN', requires_grad=True),
            norm_eval=True,
            style='pytorch'),
        neck=dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            num_outs=5),
        rpn_head=dict(
            type='RPNHead',
            in_channels=256,
            feat_channels=256,
            anchor_generator=dict(
                type='AnchorGenerator',
                scales=[8],
                ratios=[0.5, 1.0, 2.0],
                strides=[4, 8, 16, 32, 64]),
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[1.0, 1.0, 1.0, 1.0]),
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        roi_head=dict(
            type='StandardRoIHead',
            bbox_roi_extractor=dict(
                type='SingleRoIExtractor',
                roi_layer=dict(
                    type='RoIAlign', output_size=7, sampling_ratio=0),
                out_channels=256,
                featmap_strides=[4, 8, 16, 32]),
            bbox_head=dict(
                type='Shared2FCBBoxHead',
                in_channels=256,
                fc_out_channels=1024,
                roi_feat_size=7,
                num_classes=2,
                bbox_coder=dict(
                    type='DeltaXYWHBBoxCoder',
                    target_means=[0.0, 0.0, 0.0, 0.0],
                    target_stds=[0.1, 0.1, 0.2, 0.2]),
                reg_class_agnostic=False,
                loss_cls=dict(
                    type='CrossEntropyLoss',
                    use_sigmoid=False,
                    loss_weight=1.0),
                loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
        train_cfg=dict(
            rpn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.7,
                    neg_iou_thr=0.3,
                    min_pos_iou=0.3,
                    match_low_quality=True,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=256,
                    pos_fraction=0.5,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=False),
                allowed_border=-1,
                pos_weight=-1,
                debug=False),
            rpn_proposal=dict(
                nms_pre=2000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                assigner=dict(
                    type='MaxIoUAssigner',
                    pos_iou_thr=0.5,
                    neg_iou_thr=0.5,
                    min_pos_iou=0.5,
                    match_low_quality=False,
                    ignore_iof_thr=-1),
                sampler=dict(
                    type='RandomSampler',
                    num=512,
                    pos_fraction=0.25,
                    neg_pos_ub=-1,
                    add_gt_as_proposals=True),
                pos_weight=-1,
                debug=False)),
        test_cfg=dict(
            rpn=dict(
                nms_pre=1000,
                max_per_img=1000,
                nms=dict(type='nms', iou_threshold=0.7),
                min_bbox_size=0),
            rcnn=dict(
                score_thr=0.05,
                nms=dict(type='nms', iou_threshold=0.5),
                max_per_img=100)),
        init_cfg=dict(
            type='Pretrained',
            checkpoint='/tmp/data/dataset/detector.pth')),
    type='Tracktor',
    reid=dict(
        type='BaseReID',
        backbone=dict(
            type='ResNet',
            depth=50,
            num_stages=4,
            out_indices=(3, ),
            style='pytorch'),
        neck=dict(type='GlobalAveragePooling', kernel_size=(8, 4), stride=1),
        head=dict(
            type='LinearReIDHead',
            num_fcs=1,
            in_channels=2048,
            fc_channels=1024,
            out_channels=128,
            num_classes=1705,
            loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
            loss_pairwise=dict(
                type='TripletLoss', margin=0.3, loss_weight=1.0),
            norm_cfg=dict(type='BN1d'),
            act_cfg=dict(type='ReLU')),
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
            'https://download.openmmlab.com/mmtracking/mot/reid/reid_r50_6e_mot20_20210803_212426-c83b1c01.pth'
        )),
    motion=dict(
        type='CameraMotionCompensation',
        warp_mode='cv2.MOTION_EUCLIDEAN',
        num_iters=100,
        stop_eps=1e-05),
    tracker=dict(
        type='TracktorTracker',
        obj_score_thr=0.5,
        regression=dict(
            obj_score_thr=0.5,
            nms=dict(type='nms', iou_threshold=0.6),
            match_iou_thr=0.3),
        reid=dict(
            num_samples=10,
            img_scale=(256, 128),
            img_norm_cfg=None,
            match_score_thr=2.0,
            match_iou_thr=0.2),
        momentums=None,
        num_frames_retain=10))
dataset_type = 'CocoVideoDataset'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline =[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]
test_pipeline = [
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='CocoVideoDataset',
        ann_file='/tmp/data/tracking_dataset/tracking_train.json',
        img_prefix='/tmp/data/tracking_dataset/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=('car', 'person')),
    val=dict(
        type='CocoVideoDataset',
        ann_file='/tmp/data/tracking_dataset/tracking_val.json',
        img_prefix='/tmp/data/tracking_dataset/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=('car', 'person')),
    test=dict(
        type='CocoVideoDataset',
        ann_file='/tmp/data/tracking_dataset/tracking_test.json',
        img_prefix='/tmp/data/tracking_dataset/test',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=('car', 'person')))
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
checkpoint_config = dict(interval=2)
log_config = dict(interval=5, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
resume_from = None
load_from = None
workflow = [('train', 1)]
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=100,
    warmup_ratio=0.01,
    step=[6])
total_epochs = 20
evaluation = dict(metric=['bbox', 'track'], interval=2)
search_metrics = ['MOTA', 'IDF1', 'FN', 'FP', 'IDs', 'MT', 'ML']
classes = ('car', 'person')
work_dir = '/tmp/output_mmt'

Environment (I also tested using the most recent commit on the main branch of mmtrack - same problem):

TorchVision: 0.8.2
OpenCV: 4.5.3
MMCV: 1.3.8
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1
MMTracking: 0.7.0+

Is there maybe an error in my config/dataset yaml? Or could there be a bug in the code? Any help is much appreciated, thank you!

Kind regards, Alexander

GT9505 commented 2 years ago

The error reminds you that the results is a list rather than a dict. It means you send a list to the data pipeline. In the testing of mot, we usually send a dict to the data pipeline by setting ref_img_sampler in the CocoVideoDataset to None. Therefore, you need to set ref_img_sampler to None as shown in here.

a-pru commented 2 years ago

Thank you for your quick response, that indeed solved this problem. I'm now stuck with another problem, this time in regress_track() (tracktor_tracker.py).

  File "mmtracking/tools/test.py", line 191, in <module>
    main()
  File "mmtracking/tools/test.py", line 160, in main
    show_score_thr=args.show_score_thr)
  File "/databricks/driver/mmtracking/mmtrack/apis/test.py", line 46, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 42, in forward
    return super().forward(*inputs, **kwargs)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 97, in new_func
    return old_func(*args, **kwargs)
  File "/databricks/driver/mmtracking/mmtrack/models/mot/base.py", line 135, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/databricks/driver/mmtracking/mmtrack/models/mot/base.py", line 112, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/databricks/driver/mmtracking/mmtrack/models/mot/tracktor.py", line 145, in simple_test
    **kwargs)
  File "/databricks/conda/envs/databricks-ml-gpu/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 184, in new_func
    return old_func(*args, **kwargs)
  File "/databricks/driver/mmtracking/mmtrack/models/mot/trackers/tracktor_tracker.py", line 152, in track
    feats, img_metas, model.detector, frame_id, rescale)
  File "/databricks/driver/mmtracking/mmtrack/models/mot/trackers/tracktor_tracker.py", line 78, in regress_tracks
    ids = ids[valid_inds]
IndexError: index 2 is out of bounds for dimension 0 with size 2

ids: tensor([0, 1]) valid_inds: tensor([0, 2, 3, 1], device='cuda:0')

As far as I understand this means that in the previous frame two objects are detected (with ids 0 and 1) and in the current frame four objects are detected. This should not lead to an error... or did I misunderstood something again?

GT9505 commented 2 years ago

In regress_tracks() function, bboxes of the previous frame will be propagated to the current frame.

Please check whether the first dimension of bboxes and ids is the same. I guess that the bboxes and ids may not be matched.

a-pru commented 2 years ago

Yes, the first dimension of bbox and ids are equal...

bboxes: torch.Size([2, 4])
ids: torch.Size([2])
GT9505 commented 2 years ago

Then, there is something wrong with the multiclass_nms(). You can check inside the function.

a-pru commented 2 years ago

If I return keep instead of in inds[keep] in multiclass_nms() (Line 93) the tracking with Tracktor works reasonable fine. But for me it's still unclear whether if this is a bug or if I do something wrong?

GT9505 commented 2 years ago

What is the score_thr in multiclass_nms()?

a-pru commented 2 years ago

Originally, it is 0 - see the function call here But this leads to errors when an image contains very low scoring bounding boxes. Using self.regression['obj_score_thr'] solved this issue for me.

So to solve my issue I did the following: (1) multiclass_nms() (Line 93): changed inds[keep] to keep (2) regress_tracks() (Line 75): changed 0 to self.regression['obj_score_thr'] Without these two changes I was not able to run Tracktor on my custom dataset where e.g. also low scoring bounding boxes are included.

GT9505 commented 2 years ago

The inds[keep] cann't be changed to keep since it may introduce some id switches. Could you only try to change 0 to self.regression['obj_score_thr'] in order to filter low-scoring boxes to see whether the tracktor works well?

TheRealPoseidon commented 2 years ago

The inds[keep] cann't be changed to keep since it may introduce some id switches. Could you only try to change 0 to self.regression['obj_score_thr'] in order to filter low-scoring boxes to see whether the tracktor works well?

Hi GT, I also met this problem and I believe it's a bug in /mmtrack/models/mot/trackers/tracktor_tracker.py. It only occurs while tracking multi-class targets. Here is an example. Suppose we want to track two classes of targets, pedestrian & car . And in this video:tv:, only three cars are recorded. Before multiclass_nms, the track_bboxes[0] might be a tensor with 3 rows (3 bboxs) and 8 columns (bboxes for predictions of 2 classes) and the ids might be a tensor with 3 components.

ids = tensor([0,1,2])

In [def multiclass_nms](), bboxes of all classes are rearranged and reshaped. Therefore, after performing multiclass_nms, the valid_inds will be tensor([1, 3, 5]) which will obviously be indices out of range. Best regards, Po

a-pru commented 2 years ago

Hi Po, thanks for you explanation, I'm indeed also dealing with a multi-class tracking problem. I also agree with your findings.

So valid_mask is an array with shape [num_bboxes*num_classes x 1] and structure [bbox1_class1, bbox1_class2, bbox2_class1, ...], hence the values in inds are in range [0, num_bboxes*num_classes].

To map now valid_inds to the indices of the detected bounding boxes one could add valid_inds = torch.floor_divide(valid_inds, num_classes) before line 78 in regress_tracks()?

TheRealPoseidon commented 2 years ago

Hi Po, thanks for you explanation, I'm indeed also dealing with a multi-class tracking problem. I also agree with your findings.

So valid_mask is an array with shape [num_bboxes*num_classes x 1] and structure [bbox1_class1, bbox1_class2, bbox2_class1, ...], hence the values in inds are in range [0, num_bboxes*num_classes].

To map now valid_inds to the indices of the detected bounding boxes one could add valid_inds = torch.floor_divide(valid_inds, num_classes) before line 78 in regress_tracks()?

Totally agree with you, I use similar way.

GT9505 commented 2 years ago

Hi, @TheRealPoseidon . There indeed is a bug in tracktor_tracker when tracking multi-class targets. The reason is that one proposal may generate multi detection boxes after detector and NMS, as you pointed out.

@a-pru @TheRealPoseidon Adding valid_inds = torch.floor_divide(valid_inds, num_classes) before line 78 in regress_tracks() is a solution to skip the bug. However, this behavior may introduce multi-objects with the same id in the current frame, if more than 2 boxes, which belong to the same proposal, are kept after NMS. Therefore, you need to pick only one box for the repeated id, if this happens.

meikorol commented 1 year ago

hi,could you tell me how to make custom COCOVideo?what is the dataset structure,only jpgs of frame and json?or have snippe videos like vid?

RoyAn2386 commented 1 year ago

hi,could you tell me how to make custom COCOVideo?what is the dataset structure,only jpgs of frame and json?or have snippe videos like vid?

Hi guy, did you finish your training with your customedd data?