jvoas655 / AViD-SP

A method to generate visual scene graphs conditioned on images paired with iterative spoken utterances, produced alongside the VG-SPICE dataset.
GNU General Public License v3.0
0 stars 0 forks source link

AViD-SP

Paper link here

A method to generate visual scene graphs conditioned on images paired with iterative spoken utterances, produced alongside the VG-SPICE dataset.

[!NOTE] Currently working on code release preparation.

TODO

Dataset

Dataset repository for generation can be found here.

[!NOTE] Will require obtaining the Visual Genome dataset to generate.

Pregenerated dataset and cleaned VG-SPICE-C challenge test subset, as utilized in this paper, can be downloaded here.

CHiME5 dataset for noise augmentation can be downloaded here

Citation

If our code, dataset, or methology is useful please consider citing out work.

@misc{voas2024multimodal,
      title={Multimodal Contextualized Semantic Parsing from Speech}, 
      author={Jordan Voas and Raymond Mooney and David Harwath},
      year={2024},
      eprint={2406.06438},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}