taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
278 stars 21 forks source link

Inquiring about the number of views during test time. #31

Open Backdrop9019 opened 1 year ago

Backdrop9019 commented 1 year ago

Hello, thank you for the insightful research. In the paper, the views during test time are described as follows: Views = #frames × #temporal × #spatial From what I understand, #temporal and #spatial represent the number of temporal and spatial samplings during test time. I'm not very familiar with mmaction, so I'm not sure which part of the config file to refer to. How many views are there for the base-diving48 case?

test_pipeline = [ dict(type='DecordInit'), dict( type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1, frame_uniform=True, test_mode=True), dict(type='DecordDecode'), dict(type='Resize', scale=(-1, 224)), dict(type='ThreeCrop', crop_size=224), dict(type='Flip', flip_ratio=0), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]

Is it 32x1x1? What does max_testing_views mean?

I tried looking into the mmaction documentation but couldn't grasp it, so I'm asking here.

Thank you.

taoyang1122 commented 1 year ago

Hi @Backdrop9019 , 'num_clips' is #temporal, dict(type='ThreeCrop', crop_size=224) means three spatial crops (correspondingly, it has 'CenterCrop' for 1 spatial crop). So this is 32x3x1. I believe the max_testing_views is used to control testing time memory cost. You may refer to https://mmaction2.readthedocs.io/en/0.x/faq.html