Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.
Hi,
Firstly, Thank you for your interesting work.
Could you please share more information on how the captions have been generated for HD-ViLA using ASR. The paper explains that ASR-generated captions are post-processed by an off-the-shelf punctuator. But if you could kindly provide access to the generated captions (as in CLIP-ViP) or more details on which ASR technology was used, that would be really helpful in using the dataset.
Thank you.