microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
978 stars 161 forks source link

[XCLIP] How is FLOPs computed #66

Closed taoyang1122 closed 2 years ago

taoyang1122 commented 2 years ago

Thanks for the great work. I have some questions about how is the FLOPs computed.

  1. Does the FLOPs reported in Table 1 include the computations of text encoder?
  2. Could you please release the script for computing the FLOPs, which could be very helpful.

Thanks.

nbl97 commented 2 years ago

Thanks for your interest in our work ^^

  1. Yes, the FLOPs are reported on one (video, text) pair.
  2. We use fvcore to measure the FLOPs. The code snippet is like
    from fvcore.nn import FlopCountAnalysis
    model = xclip.load(...)
    video_tensor = Tensor of shape [1, F, 3, H, W]
    text_tensor = Tensor of shape [1, 77]
    flops = FlopCountAnalysis(model, (video_tensor, text_tensor))
    print(flops.total())
taoyang1122 commented 2 years ago

Thanks for your reply.

  1. Would you mind explaining why the shape of text_tensor is [1, 77]?
  2. Do we only need one (video, text) pair for the inference of a video? I thought we may need to compute the similarity between the video and all possible labels and the one with the highest similarity is the prediction. Am I understanding correctly? Thanks for your help.
nbl97 commented 2 years ago

Thanks for your reply.

  1. Would you mind explaining why the shape of text_tensor is [1, 77]?
  2. Do we only need one (video, text) pair for the inference of a video? I thought we may need to compute the similarity between the video and all possible labels and the one with the highest similarity is the prediction. Am I understanding correctly? Thanks for your help.
  1. We measure the FLOPs of video encoder with one video and that of text encoder with one sentence, so the first dimension is set to 1. The default maximum length of text is 77 in CLIP, so the second dimension is 77.
  2. Actually, we can store the text embedding of all possible labels in advance, thus only the video need inference.

Hope this can help you.

nbl97 commented 2 years ago

I'm closing this issue, but pls feel free to ping me if there are further questions.