hustvl / EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Apache License 2.0
320 stars 14 forks source link

segment multiple objects #24

Closed GinnyXiao closed 2 months ago

GinnyXiao commented 2 months ago

Thank authors for this wonderful work! I was wondering if EVF-SAM is able to segment multiple objects given a somewhat vague prompt? For instance in the zebra example, if given text prompt "zebra" (instead of "zebra top left"), would the model be able to give all 3 zebra detections?

I did this experiment and here's the result I got: zebra_vis

Any thoughts on how to extend the model's ability to segment multiple objects? I got that the task this paper trying to solve is referring object segmentation, and the datasets usually provide segment-text pairs referring only one object per image. I was wondering if it's possible to extend the model's ability to handle any kind of text prompts including vague ones that could correspond to multiple targets, since the VLM features generated from the Beit-3 models should be good enough to handle this.

Thanks!

GinnyXiao commented 2 months ago

When I set the multimask_output variable to be True in inference.py, (using SAM 2)

pred_mask = model.inference(
        image_sam.unsqueeze(0),
        image_beit.unsqueeze(0),
        input_ids,
        resize_list=[resize_shape],
        original_size_list=original_size_list,
        multimask_output=True
    )

the model is still giving only one prediction: zebra_vis_multi

CoderZhangYx commented 2 months ago
  1. In our next release of checkpoint, we are going to support semantic-level segmentation. That is to say, given the prompt "zebra", one mask containing the segmentation of all zebras in the picture would be predicted. However, we haven't figure out ways to output separated masks of each instance.
  2. As for multimask_output, it is a setting of SAM, which enables the model to predict based on three different thresholds. We manually select the mask with highest iou score in our code, so you get only one predicted mask. https://github.com/hustvl/EVF-SAM/blob/781bee05fd94e6bb59c478bab44e6c51bf7959e5/model/evf_sam.py#L289-L292
GinnyXiao commented 2 months ago

@CoderZhangYx Thanks for the quick and detailed response!

I have a follow-up question regarding the multimask_output variable in SAM. Does SAM always predict three valid masks given one prompt? In the paper, the author uses multimask output to deal with ambiguous prompt:

The requirement of a “valid” mask simply means that even when a prompt is ambiguous and could refer to multiple objects(e.g., recall the shirt vs. person example, and see Fig. 3),the output should be a reasonable mask for at least one of those objects.

The number "three" is reasonable for dealing with an ambiguous visual prompt, eg. a point. However, when dealing with text prompt, for example given a vague category label such as "pedestrian", "car", and "zebra", the model should not be limited to output only three valid masks?

Thanks again!

CoderZhangYx commented 2 months ago

Once multimask_output is set to True, the model will be forced to predict 3 masks corresponding to the visual prompt.

For Referring Expression Segmentation datasets (e.g. RefCOCO), the text-region pairs are unambiguous. Each text prompt only refers to an unique element in the image, so there is no need to worry about the ambiguity problem when using RES datasets. If you want to employ semantic segmentation datasets or instance segmentation datasets or any other segmentation datasets to fulfill an uni-model, some strategies may need to be considered to solve the ambiguity problem. We will present our strategies in our future update of Arxiv paper.

GinnyXiao commented 2 months ago

@CoderZhangYx Thanks for the reply! Looking forward to your updated manuscript!

CoderZhangYx commented 2 months ago

Our latest release of checkpoint have supported semantic-level segmentation and part segmentation.