Captions for HD-ViLA-100M

Hi,

Firstly, Thank you for your interesting work.

Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.

Thank you.

microsoft / XPretrain

Captions for HD-ViLA-100M #20