Regarding the reported 88.9% accuracy of ViT-B on the Diving48 dataset in the paper, I would like to know the testing protocol on which this result is based.
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the aforementioned validation settings, the obtained result for testing the vit_b_clip_32frame_diving48.pth is 88.88%.
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the mentioned test configuration, the obtained result is lower than 88.9%. Moreover, the ThreeCrop operation does not align with the mentioned 32×1×1 in the paper.Therefore, I would like to understand the testing protocol underlying the reported 88.9% result in the paper.
Regarding the reported 88.9% accuracy of ViT-B on the Diving48 dataset in the paper, I would like to know the testing protocol on which this result is based. val_pipeline = [ dict(type='DecordInit'), dict( type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1, frame_uniform=True, test_mode=True), dict(type='DecordDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=224), dict(type='Flip', flip_ratio=0), dict(type='Normalize', img_norm_cfg), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] When using the aforementioned validation settings, the obtained result for testing the vit_b_clip_32frame_diving48.pth is 88.88%. test_pipeline = [ dict(type='DecordInit'), dict( type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1, frame_uniform=True, test_mode=True), dict(type='DecordDecode'), dict(type='Resize', scale=(-1, 224)), dict(type='ThreeCrop', crop_size=224), dict(type='Flip', flip_ratio=0), dict(type='Normalize', img_norm_cfg), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] When using the mentioned test configuration, the obtained result is lower than 88.9%. Moreover, the ThreeCrop operation does not align with the mentioned 32×1×1 in the paper.Therefore, I would like to understand the testing protocol underlying the reported 88.9% result in the paper.