Open KunLiam opened 1 year ago
The average clip operation is performed in the base head here. The multiple clips inference result depends on the temporal information richness, if multiple clips share similar information, the improvement will be small.
The average clip operation is performed in the base head here. The multiple clips inference result depends on the temporal information richness, if multiple clips share similar information, the improvement will be small.
Thank you very much for your answer, because I asked this question two weeks ago and no one answered me, so I asked it again. I also want to ask, since the video may have a hundred or two hundred frames, is it credible to only take eight frames in the test?
Sorry for the late response. Generally speaking, multi-clip testing can bring improvement, but if the difference in temporal dimension is small, 8 (frames) x 8 (frame interval) has a span of 64 frames, which is enough to infer the result.
Sorry for the late response. Generally speaking, multi-clip testing can bring improvement, but if the difference in temporal dimension is small, 8 (frames) x 8 (frame interval) has a span of 64 frames, which is enough to infer the result.
I feel like you're a professional, because you've solved my training problems before.
I have a question about testing that I'd like to discuss with you. Because I used the action recognition code on the binary classification of cancer, but the cancer test video may have many background frames without cancer, so it is feasible to use 8 frames for testing? What if all 8 frames are background frames?
Thank you so much for answering my question!
I guess that multi-clips inference could be helpful for your project, which could avoid the effect when all frames of a single clip are backgrounds.
I guess that multi-clips inference could be helpful for your project, which could avoid the effect when all frames of a single clip are backgrounds.
Thank you very much for your answer. I was thinking why not put a whole video to the test? Is it because the model can only go in 8 frames or 16 frames?
For example, I changed the data processing in load.py:
class SampleWholeVideo(BaseTransform):
def __init__(self):
pass
def transform(self, results):
total_frames = results['total_frames']
frame_inds = np.arange(total_frames)
results['frame_inds'] = frame_inds.astype(np.int32)
results['clip_len'] = total_frames
results['frame_interval'] = 1
results['num_clips'] = 1
return results
Then, I modified the test_pipeline accordingly:
test_pipeline = [
dict(type='SampleWholeVideo'),
# dict(type='SampleFrames', clip_len=8, frame_interval=8, num_clips=30, test_mode=True),
dict(type='RawFrameDecode', **file_client_args),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]
After modification, I did a test and found that the output evaluation index seemed to be wrong. Some evaluation indicators would appear 0, 1, nan, etc. So model testing can only be done by setting dict(type='SampleFrames', clip_len=, frame_interval=, num_clips=, test_mode=True)?
Looking forward to your reply!
@cir7 Hi,thank you for your contribution! Recently I want to fintune my dataset on some new models like VideoMAE V2 、Uniformer V2, etc. But I can not find the train code of these models in the mmaction2. It is only used for testing. How can I solve this problem?I would appreciate it if you can help me! Looking forward to your reply!
Branch
main branch (1.x version, such as
v1.0.0
, ordev-1.x
branch)Prerequisite
Environment
I have installed all the required environments for the 1.x version
Describe the bug
First, I list my test_pipeline:
test_pipeline = [ dict( type='SampleFrames', clip_len=8, frame_interval=8, num_clips=10, test_mode=True), dict(type='RawFrameDecode', **file_client_args), dict(type='Resize', scale=(-1, 256)), dict(type='ThreeCrop', crop_size=256), dict(type='FormatShape', input_format='NCTHW'), dict(type='PackActionInputs') ]
Secondly, if there are 10 num_clips in the test process, how is the prediction result calculated? Are you averaging 10 num_clips or are you voting? I want to see the internal code of the test, which package can I find the specific code in?
Finally, I would appreciate it if you could answer my questions!
Reproduces the problem - code sample
No response
Reproduces the problem - command or script
No response
Reproduces the problem - error message
No response
Additional information
I tried to dbug and found that the mmengine package was used during the test, but I couldn't find where the code was actually calculated. I found the test_step function in the test_time_aug.py file, and the code is as follows:
However, I added a print statement and found that no information was printed when I tested it. My test instructions are as follows:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 PORT=29501 tools/dist_test.sh configs/recognition/tpn/tpn-slowonly_imagenet-pretrained-r50_8xb8-8x8x1-150e_kinetics400-rgb.py work_dirs/tpn-slowonly_imagenet-pretrained-r50_8xb8-8x8x1-150e_kinetics400-rgb/best_acc_top1_epoch_63.pth 1
And I found that when I used num_clips=10 and num_clips=1, the test output results were actually similar, did you do the average processing here? Where is the specific code?
I hope you can help solve my problem, thank you!