Closed kxkaixin closed 2 years ago
Thank you for taking interest in our work!
The text data is the resulting output from the CLIP tokenizer and the video data is the sampled set of frames from the original video reshaped to a height and width of 224x224 in our code.
The dimensions of text_features and video_features are [number of texts, 512] and [number of videos, number of frames, 512] respectively, where 'number of texts' and 'number of videos' are both set to the batch size in our code. The features represent the embeddings of the text and the video frames in the CLIP latent space.
Your reply has been received. Thank you very much for your reply. Thanks again!
Thank you very much for the code you shared. Because my device cannot run this code, I would like to ask you what the data dimensions of the original text and video are and what they mean respectively.
text_data = data['text']
andvideo_data = data['video']
.Text_features = self.clip.encode_text(text_data)
andvideo_features = self.clip.encode_image(video_data)
. What are the resulting data dimensions for Text_features and video_features, and what do they mean?Looking forward to your reply.