jefferyZhan/Griffon - Githubissues

# Welcome to Griffon

This is the official repo of the Griffon series (v1 & v2). Griffon is the first high-resolution (over 1K) LVLM capable of localizing everything you are interested in describing the region you specify. In the latest version, Griffon supports visual-language co-referring. You can input an image or some descriptions. Griffon achieves excellent performance in REC, object detection, object counting, visual/phrase grounding, and REG.

Griffon: Spelling out All Object Locations at Any Granuality with Large Language Model

📕Paper 🌀Usage 🤗Model

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

📕Paper

Griffon-G with More General, More Tasks, and Better Performance!

Coming in a few days!

News

[x] 2024.07.01 🔥Griffon has been accepted to ECCV 2024.
[x] 2024.03.15 🔥Griffon v2's paper has been released in 📕Arxiv.
[x] 2024.03.11 🔥We are excited to announce the arrival of Griffon v2. Griffion v2 brings fine-grained perception performance to new heights with high-resolution expert-level detection and counting and supports visual-language co-referring. Take a look at our demo first. Paper, codes, demos, and models will be released soon.
[x] 2023.12.13 🔥Ready to release the Language-prompted Localization Dataset after final approval in 🤗HuggingFace.
[x] 2023.12.06 🔥Release the inference code and model in 🤗HuggingFace.
[x] 2023.11.29 🔥Paper has been released in 📕Arxiv.

What can Griffon do now?

Griffon v2 can perform localization with free-form text inputs and visual target inputs with locally cropped images now, supporting the tasks shown below. More quantitative evaluation results can be found in our paper.

Acknowledgement

LLaVA provides the base codes and pre-trained models.
Shikra provides insight of how to organize datasets and some base processed annotations.
Llama provides the large language model.
volgachen provides the basic environment setting config.

Citation

If you find Griffon useful for your research and applications, please cite using this BibTeX:

@misc{zhan2023griffon,
      title={Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models}, 
      author={Yufei Zhan and Yousong Zhu and Zhiyang Chen and Fan Yang and Ming Tang and Jinqiao Wang},
      year={2023},
      eprint={2311.14552},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{zhan2024griffon,
      title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring}, 
      author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
      year={2024},
      eprint={2403.09333},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

The data and checkpoint is licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.