lingyunwu14 / STFT

Spatial-Temporal Feature Transformation for Video Object Detection, MICCAI2021
Other
48 stars 10 forks source link

Training and testing on VID dataset #1

Closed KennithLi closed 2 years ago

KennithLi commented 2 years ago

Thank you for sharing the codes! There exist some codes for the VID dataset in ./data/datasets/vid.py and stft_core/config/path_catalog.py. Have you trained and tested the model on the VID dataset? How it performs? It seems to perform so badly when I use the config:

MODEL:
  VID:
    ENABLE: True
    METHOD: "stft"
    STFT:
      MIN_OFFSET: -9
      MAX_OFFSET: 9
      TRAIN_REF_NUM: 2
      TEST_REF_NUM: 10
  META_ARCHITECTURE: "GeneralizedRCNNSTFT"
  WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"
  RPN_ONLY: True
  FCOS_ON: True
  STFT_ON: True
  BACKBONE:
    CONV_BODY: "R-50-FPN-RETINANET"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
    STAGE_WITH_DCN: (False, True, True, True)
    WITH_MODULATED_DCN: False
    DEFORMABLE_GROUPS: 1
    STAGE_WITH_GCB: (False, True, True, True)
  RETINANET:
    USE_C5: False
  FCOS:
    NUM_CLASSES: 31
    FPN_STRIDES: [8, 16, 32, 64, 128]
    INFERENCE_TH: 0.05
    NMS_TH: 0.6
    PRE_NMS_TOP_N: 1000
    NORM_REG_TARGETS: True
    CENTERNESS_ON_REG: True
    CENTER_SAMPLING_RADIUS: 1.5
    IOU_LOSS_TYPE: "giou"
  STFT:
    OFFSET_WEIGHT_STD: 0.01
    IOU_THRESH: 0.1
    BBOX_STD: [0.5, 0.5, 0.5, 0.5]
    REG_BETA: 0.11
DATASETS:
  TRAIN: ("VID_train_15frames",)
  TEST: ("VID_val_videos",)
INPUT:
  MIN_SIZE_TRAIN: (800,)
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.0005
  WEIGHT_DECAY: 0.0001
  IMS_PER_BATCH: 3
  WARMUP_METHOD: "linear"
  WARMUP_ITERS: 500
  CHECKPOINT_PERIOD: 125
  TEST_PERIOD: 125
  MAX_ITER: 6000
  LR_TYPE: "step"
  GAMMA: 0.5
  STEPS: (4000, 5000, 5500)
TEST:
  IMS_PER_BATCH: 3
  DETECTIONS_PER_IMG: 300
DATALOADER:
  NUM_WORKERS: 4

Evaluation results are as follows after 125 iterations:

AP50 | motion=   all = 0.0015
Category AP:
airplane        : 0.0099
antelope        : 0.0035
bear            : 0.0018
bicycle         : 0.0000
bird            : 0.0012
bus             : 0.0000
car             : 0.0012
cattle          : 0.0000
dog             : 0.0068
domestic_cat    : 0.0002
elephant        : 0.0019
fox             : 0.0000
giant_panda     : 0.0015
hamster         : 0.0083
horse           : 0.0007
lion            : 0.0000
lizard          : 0.0000
monkey          : 0.0002
motorcycle      : 0.0012
rabbit          : 0.0018
red_panda       : 0.0003
sheep           : 0.0000
snake           : 0.0000
squirrel        : 0.0004
tiger           : 0.0012
train           : 0.0007
turtle          : 0.0002
watercraft      : 0.0006
whale           : 0.0000
zebra           : 0.0024

Can you provide some suggestions?

lingyunwu14 commented 2 years ago

Hi, Sorry to reply after so long. we did run the experiment on the VID dataset. Using the ResNet50 as the backbone and FCOS as the baseline, the performance of STFT is 76.4%. "DATASETS" and "SOLVER" in the above config can be kept consistent with the "configs/BASE_RCNN_4gpu.yaml". Other adjustments can refer to the following:

MODEL:
  VID:
    STFT:
      MIN_OFFSET: -18
      MAX_OFFSET: 18
  STFT:
    OFFSET_WEIGHT_STD: 0.1
    PRED_CONV_KERNEL: 1
    IOU_THRESH: 0.6
    REG_BETA: 0.0

Although 76.4% cannot exceed SOTAs on VID, STFT still shows a more significant improvement 8.5% (67.9%->76.4%) over its image-based baseline FCOS than other video-based methods, i.e. FGFA 3.4% (70.6%->74.0%) and RDN 4.4% (71.8->76.2%). In addition, STFT has the least model parameters 43M, comparing to FGFA (89M) and RDN (53M). I don't have enough time to tune it, VID is very different from endoscopic video data. If you have some free time and are interested in it, Looking forward to your contribution.