microsoft / XPretrain

Multi-modality pre-training
Other
471 stars 37 forks source link

Hi, how to understand the LF-hdvila-8m? #38

Open sunwhw opened 7 months ago

sunwhw commented 7 months ago

Is the line in 'lfvila8m_clipid.jsonl' a video clips-sentence pair? And I see an variational number of video-clips per row. So how the video-clips of 'lfvila8m_clipid.jsonl' is divided from the original ‘hdvila_clip_text_100m.jsonl’? In addition to the selection of videos with more than 4 clips mentioned in the paper, are there any details?

image
GXYM commented 6 months ago

Is the line in 'lfvila8m_clipid.jsonl' a video clips-sentence pair? And I see an variational number of video-clips per row. So how the video-clips of 'lfvila8m_clipid.jsonl' is divided from the original ‘hdvila_clip_text_100m.jsonl’? In addition to the selection of videos with more than 4 clips mentioned in the paper, are there any details? image

Where can I find annotation files containing video captions, "hdvila_clip_text_100m.jsonl" ? Thanks