microsoft / XPretrain

Multi-modality pre-training
Other
467 stars 36 forks source link

Captions for HD-ViLA-100M #20

Closed hanoonaR closed 1 year ago

hanoonaR commented 1 year ago

Hi,

Firstly, Thank you for your interesting work.

Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.

Thank you.

bei21 commented 1 year ago

@hanoonaR Please refer to #6 for the transcripts of HD-VILA-100M. Thanks.