Closed taoyang1122 closed 2 years ago
Thanks for your interest in our work ^^
from fvcore.nn import FlopCountAnalysis
model = xclip.load(...)
video_tensor = Tensor of shape [1, F, 3, H, W]
text_tensor = Tensor of shape [1, 77]
flops = FlopCountAnalysis(model, (video_tensor, text_tensor))
print(flops.total())
Thanks for your reply.
Thanks for your reply.
- Would you mind explaining why the shape of text_tensor is [1, 77]?
- Do we only need one (video, text) pair for the inference of a video? I thought we may need to compute the similarity between the video and all possible labels and the one with the highest similarity is the prediction. Am I understanding correctly? Thanks for your help.
Hope this can help you.
I'm closing this issue, but pls feel free to ping me if there are further questions.
Thanks for the great work. I have some questions about how is the FLOPs computed.
Thanks.