taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
276 stars 21 forks source link

I would like to know what testing protocol the 88.9% on Diving48 is based on. #46

Open Changwei-Ouyang opened 11 months ago

Changwei-Ouyang commented 11 months ago

Regarding the reported 88.9% accuracy of ViT-B on the Diving48 dataset in the paper, I would like to know the testing protocol on which this result is based. val_pipeline = [ dict(type='DecordInit'), dict( type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1, frame_uniform=True, test_mode=True), dict(type='DecordDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=224), dict(type='Flip', flip_ratio=0), dict(type='Normalize', img_norm_cfg), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] When using the aforementioned validation settings, the obtained result for testing the vit_b_clip_32frame_diving48.pth is 88.88%. test_pipeline = [ dict(type='DecordInit'), dict( type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1, frame_uniform=True, test_mode=True), dict(type='DecordDecode'), dict(type='Resize', scale=(-1, 224)), dict(type='ThreeCrop', crop_size=224), dict(type='Flip', flip_ratio=0), dict(type='Normalize', img_norm_cfg), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] When using the mentioned test configuration, the obtained result is lower than 88.9%. Moreover, the ThreeCrop operation does not align with the mentioned 32×1×1 in the paper.Therefore, I would like to understand the testing protocol underlying the reported 88.9% result in the paper.