A method to generate visual scene graphs conditioned on images paired with iterative spoken utterances, produced alongside the VG-SPICE dataset.
[!NOTE] Currently working on code release preparation.
Dataset repository for generation can be found here.
[!NOTE] Will require obtaining the Visual Genome dataset to generate.
Pregenerated dataset and cleaned VG-SPICE-C challenge test subset, as utilized in this paper, can be downloaded here.
CHiME5 dataset for noise augmentation can be downloaded here
If our code, dataset, or methology is useful please consider citing out work.
@misc{voas2024multimodal,
title={Multimodal Contextualized Semantic Parsing from Speech},
author={Jordan Voas and Raymond Mooney and David Harwath},
year={2024},
eprint={2406.06438},
archivePrefix={arXiv},
primaryClass={cs.CL}
}