hustvl / EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Apache License 2.0
232 stars 8 forks source link
multimodal multimodal-large-language-models referring-image-segmentation segment-anything segmentation

πŸ“· EVF-SAM

Early Vision-Language Fusion for Text-Prompted Segment Anything Model

[Yuxuan Zhang](https://github.com/CoderZhangYx)1,\*, [Tianheng Cheng](https://scholar.google.com/citations?user=PH8rJHYAAAAJ&hl=zh-CN)1,\*, Lei Liu2, Heng Liu2, Longjin Ran2, Xiaoxin Chen2, [Wenyu Liu](http://eic.hust.edu.cn/professor/liuwenyu)1, [Xinggang Wang](https://xwcv.github.io/)1,πŸ“§ 1 Huazhong University of Science and Technology, 2 vivo AI Lab (\* equal contribution, πŸ“§ corresponding author) [![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2406.20076) [![πŸ€— HuggingFace models](https://img.shields.io/badge/HuggingFaceπŸ€—-Models-orange)](https://huggingface.co/YxZhang/) [![πŸ€— HuggingFace Demo](https://img.shields.io/badge/EVF_SAM-πŸ€—_HF_Demo-orange)](https://huggingface.co/spaces/wondervictor/evf-sam) [![πŸ€— HuggingFace Demo](https://img.shields.io/badge/EVF_SAM_2-πŸ€—_HF_Demo-orange)](https://huggingface.co/spaces/wondervictor/evf-sam2) [![colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hustvl/EVF-SAM/blob/main/inference_image.ipynb)

News

We have expanded our EVF-SAM to powerful SAM-2. Besides improvements on image prediction, our new model also performs well on video prediction (powered by SAM-2). Only at the expense of a simple image training process on RES datasets, we find our EVF-SAM has zero-shot video text-prompted capability. Try our code!

Highlight

Updates

Visualization

Input text Input image Output
"zebra top left"
"a pizza with a yellow sign on top of it"
"the broccoli closest to the ketchup bottle"
"bus going to south common"
"3carrots in center with ice and greenn leaves"

Installation

  1. Clone this repository
  2. Install pytorch for your cuda version. Note that torch>=2.0.0 is needed if you are to use SAM-2, and torch>=2.2 is needed if you want to enable flash-attention. (We use torch==2.0.1 with CUDA 11.7 and it works fine.)
  3. pip install -r requirements.txt
  4. If you are to use the video prediction function, run:
    cd model/segment_anything_2
    python setup.py build_ext --inplace

Weights

Name SAM BEIT-3 Params Prompt Encoder & Mask Decoder Reference Score
EVF-SAM2 SAM-2-L BEIT-3-L 898M freeze 83.6
EVF-SAM SAM-H BEIT-3-L 1.32B train 83.7
EVF-Effi-SAM-L EfficientSAM-S BEIT-3-L 700M train 83.5
EVF-Effi-SAM-B EfficientSAM-T BEIT-3-B 232M train 80.0

Inference

1. image prediction

python inference.py  \
  --version <path to evf-sam> \
  --precision='fp16' \
  --vis_save_path "<path to your output direction>" \
  --model_type <"ori" or "effi" or "sam2", depending on your loaded ckpt>   \
  --image_path <path to your input image> \
  --prompt <customized text prompt>

--load_in_8bit and --load_in_4bit is optional
for example:

python inference.py  \
  --version YxZhang/evf-sam2 \
  --precision='fp16' \
  --vis_save_path "vis" \
  --model_type sam2   \
  --image_path "assets/zebra.jpg" \
  --prompt "zebra top left"

2. video prediction

firstly slice video into frames

ffmpeg -i <your_video>.mp4 -q:v 2 -start_number 0 <frame_dir>/'%05d.jpg'

then:

python inference_video.py  \
  --version <path to evf-sam2> \
  --precision='fp16' \
  --vis_save_path "vis/" \
  --image_path <frame_dir>   \
  --prompt <customized text prompt>   \
  --model_type sam2

you can use frame2video.py to concat the predicted frames to a video.

Demo

image demo

python demo.py <path to evf-sam>

video demo

python demo_video.py <path to evf-sam2>

Data preparation

Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12) and COCO2014train

β”œβ”€β”€ dataset
β”‚Β Β  β”œβ”€β”€ refer_seg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ images
β”‚Β Β  β”‚Β Β  |   β”œβ”€β”€ saiapr_tc-12 
β”‚Β Β  β”‚Β Β  |   └── mscoco
β”‚Β Β  β”‚Β Β  |       └── images
β”‚Β Β  β”‚Β Β  |           └── train2014
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refclef
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refcoco
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ refcoco+
β”‚Β Β  β”‚Β Β  └── refcocog

Evaluation

torchrun --standalone --nproc_per_node <num_gpus> eval.py   \
    --version <path to evf-sam> \
    --dataset_dir <path to your data root>   \
    --val_dataset "refcoco|unc|val" \
    --model_type <"ori" or "effi" or "sam2", depending on your loaded ckpt>

Acknowledgement

We borrow some codes from LISA, unilm, SAM, EfficientSAM, SAM-2.

Citation

@article{zhang2024evfsamearlyvisionlanguagefusion,
      title={EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model}, 
      author={Yuxuan Zhang and Tianheng Cheng and Rui Hu and Lei Liu and Heng Liu and Longjin Ran and Xiaoxin Chen and Wenyu Liu and Xinggang Wang},
      year={2024},
      eprint={2406.20076},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.20076}, 
}