microsoft / SoM

Set-of-Mark Prompting for GPT-4V and LMMs
MIT License
1.2k stars 96 forks source link

Logo Set-of-Mark Visual Prompting for GPT-4V

:grapes: [Read our arXiv Paper]   :apple: [Project Page]

Jianwei Yang*βš‘, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao

* Core Contributors      βš‘ Project Lead

Introduction

We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities in the strongest LMM -- GPT-4V. Let's using visual prompting for vision!

method2_xyz

GPT-4V + SoM Demo

https://github.com/microsoft/SoM/assets/3894247/8f827871-7ebd-4a5e-bef5-861516c4427b

πŸ”₯ News

πŸ”— Fascinating Applications

Fascinating applications of SoM in GPT-4V:

πŸ”— Related Works

Our method compiles the following models to generate the set of marks:

We are standing on the shoulder of the giant GPT-4V (playground)!

:rocket: Quick Start

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && bash make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
sh download_ckpt.sh
python demo_som.py

And you will see this interface:

som_toolbox

Deploy to AWS

To deploy SoM to EC2 on AWS via Github Actions:

  1. Fork this repository and clone your fork to your local machine.
  2. Follow the instructions at the top of deploy.py.

:point_right: Comparing standard GPT-4V and its combination with SoM Prompting

teaser_github

:round_pushpin: SoM Toolbox for image partition

method3_xyz Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.

:unicorn: Interleaved Prompt

SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices.

Screenshot 2023-10-18 at 10 06 18

:medal_military: Mark types used in SoM

method4_xyz

:volcano: Evaluation tasks examples

Screenshot 2023-10-18 at 10 12 18

Use case

:tulip: Grounded Reasoning and Cross-Image Reference

Screenshot 2023-10-18 at 10 10 41

In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17

:camping: Problem Solving

Screenshot 2023-10-18 at 10 18 03

Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.

:mountain_snow: Knowledge Sharing

Screenshot 2023-10-18 at 10 18 44

Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.

:mosque: Personalized Suggestion

Screenshot 2023-10-18 at 10 19 12

SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks

:blossom: Tool Usage Instruction

Screenshot 2023-10-18 at 10 19 39

Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction , teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.

:sunflower: 2D Game Planning

Screenshot 2023-10-18 at 10 20 03

GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.

:mosque: Simulated Navigation

Screenshot 2023-10-18 at 10 21 24

:deciduous_tree: Results

We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation. main_results

:black_nib: Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{yang2023setofmark,
      title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V}, 
      author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
      journal={arXiv preprint arXiv:2310.11441},
      year={2023},
}