[Bug] calculated during the test

KunLiam commented 1 year ago

Branch

main branch (1.x version, such as v1.0.0, or dev-1.x branch)

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Environment

I have installed all the required environments for the 1.x version

Describe the bug

First, I list my test_pipeline： test_pipeline = [ dict( type='SampleFrames', clip_len=8, frame_interval=8, num_clips=10, test_mode=True), dict(type='RawFrameDecode', **file_client_args), dict(type='Resize', scale=(-1, 256)), dict(type='ThreeCrop', crop_size=256), dict(type='FormatShape', input_format='NCTHW'), dict(type='PackActionInputs') ]

Secondly, if there are 10 num_clips in the test process, how is the prediction result calculated? Are you averaging 10 num_clips or are you voting? I want to see the internal code of the test, which package can I find the specific code in?

Finally, I would appreciate it if you could answer my questions!

Reproduces the problem - code sample

No response

Reproduces the problem - command or script

No response

Reproduces the problem - error message

No response

Additional information

I tried to dbug and found that the mmengine package was used during the test, but I couldn't find where the code was actually calculated. I found the test_step function in the test_time_aug.py file, and the code is as follows:

def test_step(self, data):
    """Get predictions of each enhanced data, a multiple predictions.

    Args:
        data (DataBatch): Enhanced data batch sampled from dataloader.

    Returns:
        MergedDataSamples: Merged prediction.
    """
    data_list: Union[List[dict], List[list]]
    if isinstance(data, dict):
        num_augs = len(data[next(iter(data))])
        data_list = [{key: value[idx]
                      for key, value in data.items()}
                     for idx in range(num_augs)]
    elif isinstance(data, (tuple, list)):
        num_augs = len(data[0])
        data_list = [[_data[idx] for _data in data]
                     for idx in range(num_augs)]
    else:
        raise TypeError('data given by dataLoader should be a dict, '
                        f'tuple or a list, but got {type(data)}')

    predictions = []
    print('*********predictions', predictions)
    for data in data_list:  # type: ignore
        predictions.append(self.module.test_step(data))
    return self.merge_preds(list(zip(*predictions)))

However, I added a print statement and found that no information was printed when I tested it. My test instructions are as follows: CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 PORT=29501 tools/dist_test.sh configs/recognition/tpn/tpn-slowonly_imagenet-pretrained-r50_8xb8-8x8x1-150e_kinetics400-rgb.py work_dirs/tpn-slowonly_imagenet-pretrained-r50_8xb8-8x8x1-150e_kinetics400-rgb/best_acc_top1_epoch_63.pth 1

And I found that when I used num_clips=10 and num_clips=1, the test output results were actually similar, did you do the average processing here? Where is the specific code?

I hope you can help solve my problem, thank you!

cir7 commented 1 year ago

The average clip operation is performed in the base head here. The multiple clips inference result depends on the temporal information richness, if multiple clips share similar information, the improvement will be small.

KunLiam commented 1 year ago

The average clip operation is performed in the base head here. The multiple clips inference result depends on the temporal information richness, if multiple clips share similar information, the improvement will be small.

Thank you very much for your answer, because I asked this question two weeks ago and no one answered me, so I asked it again. I also want to ask, since the video may have a hundred or two hundred frames, is it credible to only take eight frames in the test?

cir7 commented 1 year ago

Sorry for the late response. Generally speaking, multi-clip testing can bring improvement, but if the difference in temporal dimension is small, 8 (frames) x 8 (frame interval) has a span of 64 frames, which is enough to infer the result.

KunLiam commented 1 year ago

Sorry for the late response. Generally speaking, multi-clip testing can bring improvement, but if the difference in temporal dimension is small, 8 (frames) x 8 (frame interval) has a span of 64 frames, which is enough to infer the result.

I feel like you're a professional, because you've solved my training problems before.

I have a question about testing that I'd like to discuss with you. Because I used the action recognition code on the binary classification of cancer, but the cancer test video may have many background frames without cancer, so it is feasible to use 8 frames for testing? What if all 8 frames are background frames?

Thank you so much for answering my question!

cir7 commented 1 year ago

I guess that multi-clips inference could be helpful for your project, which could avoid the effect when all frames of a single clip are backgrounds.

KunLiam commented 1 year ago

I guess that multi-clips inference could be helpful for your project, which could avoid the effect when all frames of a single clip are backgrounds.

Thank you very much for your answer. I was thinking why not put a whole video to the test? Is it because the model can only go in 8 frames or 16 frames?

For example, I changed the data processing in load.py:

class SampleWholeVideo(BaseTransform):
    def __init__(self):
        pass

    def transform(self, results):
        total_frames = results['total_frames']
        frame_inds = np.arange(total_frames)
        results['frame_inds'] = frame_inds.astype(np.int32)
        results['clip_len'] = total_frames
        results['frame_interval'] = 1
        results['num_clips'] = 1

        return results

Then, I modified the test_pipeline accordingly:

test_pipeline = [
    dict(type='SampleWholeVideo'),
    # dict(type='SampleFrames', clip_len=8, frame_interval=8, num_clips=30, test_mode=True),
    dict(type='RawFrameDecode', **file_client_args),
    dict(type='Resize', scale=(-1, 224)),
    dict(type='ThreeCrop', crop_size=224),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='PackActionInputs')
]

After modification, I did a test and found that the output evaluation index seemed to be wrong. Some evaluation indicators would appear 0, 1, nan, etc. So model testing can only be done by setting dict(type='SampleFrames', clip_len=, frame_interval=, num_clips=, test_mode=True)？

Looking forward to your reply!

JinChow commented 1 year ago

@cir7 Hi，thank you for your contribution! Recently I want to fintune my dataset on some new models like VideoMAE V2 、Uniformer V2, etc. But I can not find the train code of these models in the mmaction2. It is only used for testing. How can I solve this problem？I would appreciate it if you can help me！ Looking forward to your reply!

open-mmlab / mmaction2