swathikirans / GSM

Gate-Shift Networks for Video Action Recognition - CVPR 2020
Other
149 stars 17 forks source link

about the input #10

Closed Usernamezhx closed 4 years ago

Usernamezhx commented 4 years ago

thanks for your work. I want know if there is the theoretical basis about the double input can improve the acc.

swathikirans commented 4 years ago

I did not get the question. Can you explain?

Usernamezhx commented 4 years ago

such as:

    for image_segments_cv2 in videos:

        try:
            if len(image_segments_cv2) >= 10:
                image_segments_cv2 = image_segments_cv2[1:9]
            image_segments = [Image.fromarray(img) for img in image_segments_cv2]
            image_segments = transform(image_segments)
            process_data_final = [image_segments,image_segments]    # <--------------here. double input
            process_data_final = torch.stack(process_data_final, 0)
            input_var = process_data_final.view(-1, 3, process_data_final.size(2), process_data_final.size(3))
            rst = net(input_var)
swathikirans commented 4 years ago

I am sorry, where is this code snippet from?

Usernamezhx commented 4 years ago

so sorry reply late. I just want to make sure. the snippet code reference from here https://github.com/swathikirans/GSM/blob/43e8ebad5cf1bf2aaca1674a753a89fcba416321/dataset.py#L151

swathikirans commented 4 years ago

This is equivalent of extracting two different sequences of frames from a video. The network will predict the action category from these two sequences separately and average the scores to obtain the final prediction. Since the two sequences contain different information (frames), the chances of predicting the correct category increases.

Usernamezhx commented 4 years ago

thanks for your reply. I get it.