sming256 / OpenTAD

OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
Apache License 2.0
106 stars 5 forks source link

epic-kitchens100 eval is nan #10

Closed rixejzvdl649 closed 2 months ago

rixejzvdl649 commented 2 months ago

3615f1fa-2363-40a9-89c8-bfad7b2170ac

#===========================================================================================================
annotation_path = "train_code/OpenTAD/data/epic_kitchens-100/annotations/epic_kitchens_verb.json"
class_map = "train_code/OpenTAD/data/epic_kitchens-100/annotations/category_idx_verb.txt"
data_path = "data/video_dataset/EPIC-Kitchens100/epic_kitchens_100_30fps_512x288/"
block_list = None

window_size = 768*8
scale_factor = 1
chunk_num = window_size * scale_factor // 16
# 768/16=48 chunks, since videomae takes 16 frames as input

# ==================================================
#Video Preprocessing Sliding Window
#Frame Stride 2
#Frame Number 768×8
# ==================================================

dataset = dict(
    train=dict(
        type="EpicKitchensSlidingDataset",
        ann_file=annotation_path,
        subset_name="training",
        block_list=block_list,
        class_map=class_map,
        data_path=data_path,
        filter_gt=False,
        # ==================================================
        #Video Preprocessing Sliding Window
        #Frame Stride 2
        #Frame Number 768×8
        # ==================================================
        # ==================================================
        feature_stride=2,
        sample_stride=1,

        fps=30,

        offset_frames=8,

        window_size=window_size,
        window_overlap_ratio=0.5,
        # ==================================================
        pipeline=[
            dict(type="PrepareVideoInfo", format="mp4"),
            dict(type="mmaction.DecordInit", num_threads=4),
            dict(type="LoadFrames", 
                 num_clips=1, 
                 method="sliding_window", 
                 scale_factor=scale_factor),
            dict(type="mmaction.DecordDecode"),
            #================================================================
            #Frame Resolution 160×160
            #RandomResizedCrop + Flip + ImgAug + ColorJitter
            #================================================================
            dict(type="mmaction.Resize", scale=(-1, 182)),
            dict(type="mmaction.RandomResizedCrop"),
            dict(type="mmaction.Resize", scale=(160, 160), keep_ratio=False),
            dict(type="mmaction.Flip", flip_ratio=0.5),
            dict(type="mmaction.ImgAug", transforms="default"),
            dict(type="mmaction.ColorJitter"),
            dict(type="mmaction.FormatShape", input_format="NCTHW"),
            dict(type="ConvertToTensor", keys=["imgs", "gt_segments", "gt_labels"]),
            dict(type="Collect", inputs="imgs", keys=["masks", "gt_segments", "gt_labels"]),
        ],
    ),
    val=dict(
        type="EpicKitchensSlidingDataset",
        ann_file=annotation_path,
        subset_name="val",
        block_list=block_list,
        class_map=class_map,
        data_path=data_path,
        filter_gt=False,
        # ==================================================

        feature_stride=2,
        sample_stride=1,

        fps=30,

        offset_frames=8,

        window_size=window_size,
        window_overlap_ratio=0.5,
        # ==================================================
        pipeline=[
            dict(type="PrepareVideoInfo", format="mp4"),
            dict(type="mmaction.DecordInit", num_threads=4),
            dict(type="LoadFrames", 
                 num_clips=1, 
                 method="sliding_window", 
                 scale_factor=scale_factor),
            dict(type="mmaction.DecordDecode"),
            dict(type="mmaction.Resize", scale=(-1, 160)),
            dict(type="mmaction.CenterCrop", crop_size=160),
            dict(type="mmaction.FormatShape", input_format="NCTHW"),
            dict(type="ConvertToTensor", keys=["imgs", "gt_segments", "gt_labels"]),
            dict(type="Collect", inputs="imgs", keys=["masks", "gt_segments", "gt_labels"]),
        ],
    ),
    test=dict(
        type="EpicKitchensSlidingDataset",
        ann_file=annotation_path,
        subset_name="val",
        block_list=block_list,
        class_map=class_map,
        data_path=data_path,
        filter_gt=False,
        # ==================================================
        test_mode=True,

        feature_stride=2,
        sample_stride=1,

        fps=30,

        offset_frames=8,

        window_size=window_size,
        window_overlap_ratio=0.5,
        # ==================================================
        pipeline=[
            dict(type="PrepareVideoInfo", format="mp4"),
            dict(type="mmaction.DecordInit", num_threads=4),
            dict(type="LoadFrames", 
                 num_clips=1, 
                 method="sliding_window", 
                 scale_factor=scale_factor),
            dict(type="mmaction.DecordDecode"),
            dict(type="mmaction.Resize", scale=(-1, 160)),
            dict(type="mmaction.CenterCrop", crop_size=160),
            dict(type="mmaction.FormatShape", input_format="NCTHW"),
            dict(type="ConvertToTensor", keys=["imgs"]),
            dict(type="Collect", inputs="imgs", keys=["masks"]),
        ],
    ),
)

evaluation = dict(
    type="mAP",
    subset="validation",
    tiou_thresholds=[0.3, 0.4, 0.5, 0.6, 0.7],
    ground_truth_filename=annotation_path,
)

#===========================================================================================================
_base_ = [
    "/mnt2/ninghuayang/train_code/OpenTAD/configs/_base_/models/actionformer.py",
]
model = dict(
    backbone=dict(
        type="mmaction.Recognizer3D",
        backbone=dict(
            type="VisionTransformerAdapter",
            img_size=224,
            patch_size=16,
            embed_dims=1024,
            depth=24,
            num_heads=16,
            mlp_ratio=4,
            qkv_bias=True,
            num_frames=16,
            drop_path_rate=0.1,
            norm_cfg=dict(type="LN", eps=1e-6),
            return_feat_map=True,
            with_cp=True,  # enable activation checkpointing
            total_frames=window_size * scale_factor,
            adapter_index=list(range(24)),
        ),
        data_preprocessor=dict(
            type="mmaction.ActionDataPreprocessor",
            mean=[123.675, 116.28, 103.53],
            std=[58.395, 57.12, 57.375],
            format_shape="NCTHW",
        ),
        custom=dict(
            pretrain="pretrained/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth",
            pre_processing_pipeline=[
                dict(type="Rearrange", 
                     keys=["frames"], 
                     ops="b n c (t1 t) h w -> (b t1) n c t h w", 
                     t1=chunk_num),
            ],
            post_processing_pipeline=[
                #=========================
                #Spatial Average Pooling + Resize
                #Feature Resize Length  768
                #=========================
                dict(type="Reduce", 
                     keys=["feats"], 
                     ops="b n c t h w -> b c t", 
                     reduction="mean"),
                dict(type="Rearrange", 
                     keys=["feats"], 
                     ops="(b t1) c t -> b c (t1 t)", 
                     t1=chunk_num),
                #=========================
                #Spatial Average Pooling + Resize
                #Feature Resize Length  768
                #=========================
                dict(type="Interpolate", 
                     keys=["feats"], 
                     size=768),
            ],
            norm_eval=False,  # also update the norm layers
            freeze_backbone=False,  # unfreeze the backbone
        ),
    ),
    projection=dict(
        in_channels=1024,
        max_seq_len=768,
        attn_cfg=dict(n_mha_win_size=9),
    ),
    rpn_head=dict(
        num_classes=97,
        prior_generator=dict(
            strides=[1, 2, 4, 8, 16, 32],
            regression_range=[(0, 4), (2, 8), (4, 16), (8, 32), (16, 64), (32, 10000)],
        ),
        loss_normalizer=250,
    ),
)

#=============
#Batch Size 2
#=============
solver = dict(
    train=dict(batch_size=2, num_workers=4),
    val=dict(batch_size=2, num_workers=2),
    test=dict(batch_size=2, num_workers=2),
    clip_grad_norm=1,
    amp=True,
    fp16_compress=True,
    static_graph=True,
    ema=True,
)

optimizer = dict(
    type="AdamW",
    lr=1e-4,
    weight_decay=0.05,
    paramwise=True,
    backbone=dict(
        lr=0,
        weight_decay=0,
        custom=[dict(name="adapter", lr=1e-4, weight_decay=0.05)],
        exclude=["backbone"],
    ),
)
scheduler = dict(type="LinearWarmupCosineAnnealingLR", 
                 #=============
                 #Warmup Epoch 5
                 #=============
                 warmup_epoch=5, 
                 max_epoch=20)

inference = dict(load_from_raw_predictions=False, save_raw_prediction=False)
post_processing = dict(
    pre_nms_topk=5000,
    nms=dict(
        use_soft_nms=True,
        sigma=0.4,
        max_seg_num=2000,
        iou_threshold=0,  # does not matter when use soft nms
        min_score=0.001,
        multiclass=True,
        voting_thresh=0.75,  #  set 0 to disable
    ),
    save_dict=False,
)

#=============
#Total Epoch 35
#=============
workflow = dict(
    logging_interval=10,
    checkpoint_interval=2,
    val_loss_interval=-1,
    val_eval_interval=2,
    val_start_epoch=2,
    end_epoch=60,
)

work_dir = "exps"
rixejzvdl649 commented 2 months ago
2024-05-11 09:45:57 Train INFO: Evaluation starts...
2024-05-11 09:45:58 Train INFO: Loaded annotations from validation subset.
2024-05-11 09:45:58 Train INFO: Number of ground truth instances: 0
2024-05-11 09:45:58 Train INFO: Number of predictions: 261981
2024-05-11 09:45:58 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-05-11 09:45:58 Train INFO: Average-mAP:  nan (%)
2024-05-11 09:45:58 Train INFO: mAP at tIoU 0.30 is  nan%
2024-05-11 09:45:58 Train INFO: mAP at tIoU 0.40 is  nan%
2024-05-11 09:45:58 Train INFO: mAP at tIoU 0.50 is  nan%
2024-05-11 09:45:58 Train INFO: mAP at tIoU 0.60 is  nan%
2024-05-11 09:45:58 Train INFO: mAP at tIoU 0.70 is  nan%
2024-05-11 09:45:58 Train INFO: [Train]: Epoch 18 started
2024-05-11 09:47:32 Train INFO: [Train]: [018][00010/00637]  Loss=0.7536  cls_loss=0.4252  reg_loss=0.3284  lr_backbone=9.5e-05  lr_det=9.5e-05  mem=49632MB
2024-05-11 09:48:36 Train INFO: [Train]: [018][00020/00637]  Loss=0.6853  cls_loss=0.3880  reg_loss=0.2973  lr_backbone=9.5e-05  lr_det=9.5e-05  mem=49632MB
2024-05-11 09:49:40 Train INFO: [Train]: [018][00030/00637]  Loss=0.7370  cls_loss=0.4174  reg_loss=0.3196  lr_backbone=9.5e-05  lr_det=9.5e-05  mem=49632MB
2024-05-11 09:50:44 Train INFO: [Train]: [018][00040/00637]  Loss=0.7869  cls_loss=0.4448  reg_loss=0.3420  lr_backbone=9.5e-05  lr_det=9.5e-05  mem=49632MB
sming256 commented 2 months ago

I see your ground truth instance number is 0. This is not correct. Please check your annotation file and compare it with the released annotation and evaluation config.