hustvl / EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Apache License 2.0
311 stars 13 forks source link

Implement EVFSam with segment anything v2 #4

Closed BeerMaster228 closed 3 months ago

BeerMaster228 commented 3 months ago

Do you think it would be a simple task to integrate https://github.com/facebookresearch/segment-anything-2 instead of previous version? If help is needed, I can collaborate.

CoderZhangYx commented 3 months ago

Yes, we are currently working on that. The new checkpoint based on sam-2 is during training. We will release model, inference code and demo as soon as possible. Thanks for your attention!

CoderZhangYx commented 3 months ago

We have supported SAM-2! Try our code! Tell me if you have any questions using our model!

BeerMaster228 commented 3 months ago

We have supported SAM-2! Try our code! Tell me if you have any questions using our model!

Thanks for the quick integration with SAM2! What would you recommend for the scenario when there are multiple objects fitting the textual description on the image. Currently the model only segments one of them. I went with inpainting segmented objects, and the prompting the model again until there is no more. But this solution is rather slow, and has many edge cases.

CoderZhangYx commented 3 months ago

We have supported SAM-2! Try our code! Tell me if you have any questions using our model!

Thanks for the quick integration with SAM2! What would you recommend for the scenario when there are multiple objects fitting the textual description on the image. Currently the model only segments one of them. I went with inpainting segmented objects, and the prompting the model again until there is no more. But this solution is rather slow, and has many edge cases.

An interesting implementation! Currently our model may not behave well on multi-object segmentation. This lies in the fact that SAM is an instance-level segmentation model. That is to say, one interactive prompt is aligned with one instance mask during training. However, there may exist some multi-stage ways. For example, you can:

  1. call a LLM (gpt, gemini, qwen-vl...) to generage separate text descriptions of all the objects you want to segment in your image;
  2. use batch inference to generate separate masks and fuse them together to one.
  3. call your inpainting model.

Furthermore, you may try some other terrific works that support semantic-level interactive segmentation!