AViD-SP

Paper link here

A method to generate visual scene graphs conditioned on images paired with iterative spoken utterances, produced alongside the VG-SPICE dataset.

[!NOTE] Currently working on code release preparation.

TODO

[ ] Upload code
[ ] Link dataset download.

Dataset

Dataset repository for generation can be found here.

[!NOTE] Will require obtaining the Visual Genome dataset to generate.

Pregenerated dataset and cleaned VG-SPICE-C challenge test subset, as utilized in this paper, can be downloaded here.

CHiME5 dataset for noise augmentation can be downloaded here

Citation

If our code, dataset, or methology is useful please consider citing out work.

@misc{voas2024multimodal,
      title={Multimodal Contextualized Semantic Parsing from Speech}, 
      author={Jordan Voas and Raymond Mooney and David Harwath},
      year={2024},
      eprint={2406.06438},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

jvoas655 / AViD-SP

readme

AViD-SP

TODO

Dataset

Citation