tgc1997 / RMN

IJCAI2020: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
79 stars 12 forks source link

Inference on custom raw video #6

Closed amil-rp-work closed 3 years ago

amil-rp-work commented 3 years ago

Hey @tgc1997 Thanks for providing the implementations of such an awesome work!!!

I wanted to know how does one go about using the pre-trained models for inferencing on raw custom videos?

tgc1997 commented 3 years ago

First, you need to extract the video features and then feed these features to the model.

amil-rp-work commented 3 years ago

About extracting video features, I have extracted i3d features for my video. Which layer of InceptionResnetV2 should be used for extracting frame features? Also for BUTD, I referred the issue and it says the script is modified, so is the new script sufficient for usage?

tgc1997 commented 3 years ago
  1. IRV2 extractor:

    class AppearanceEncoder_inceptionresnetv2(nn.Module):
    def __init__(self):
        super(AppearanceEncoder_inceptionresnetv2, self).__init__()
        IRV2 = InceptionResNetV2(num_classes=1001)
        # print('IRV2:\n', IRV2)
        IRV2.load_state_dict(torch.load(opt.IRV2_checkpoint))
        modules = list(IRV2.children())[:-1]  # delete the last fc layer.
        self.IRV2 = nn.Sequential(*modules)
        # print('IRV2:\n', self.IRV2)
    
    def forward(self, images):
        """Extract feature vectors from input images."""
        with torch.no_grad():
            features = self.IRV2(images)
        features = features.reshape(features.size(0), -1)
        # print(features.size())
        return features

    Image preprocessing:

    image -= np.array([0.5, 0.5, 0.5])
    image /= np.array([0.5, 0.5, 0.5])

    2.About BUTD, just feed your image to the pre-trained BUTD model and you will get 36 region features. What I modifed is preprossing MSRVTT and MSVD, and saving the extracted features to h5 file, the core code is remained.

monic7 commented 3 years ago

Hi tgc

Great work! Thanks for sharing Would like to ask if you could share your codes on feature extraction using i3d, irv2 and butd? Im not sure how to extract these video features.

Thank you