Closed GinnyXiao closed 2 months ago
When I set the multimask_output
variable to be True
in inference.py
, (using SAM 2)
pred_mask = model.inference(
image_sam.unsqueeze(0),
image_beit.unsqueeze(0),
input_ids,
resize_list=[resize_shape],
original_size_list=original_size_list,
multimask_output=True
)
the model is still giving only one prediction:
multimask_output
, it is a setting of SAM, which enables the model to predict based on three different thresholds. We manually select the mask with highest iou score in our code, so you get only one predicted mask. https://github.com/hustvl/EVF-SAM/blob/781bee05fd94e6bb59c478bab44e6c51bf7959e5/model/evf_sam.py#L289-L292@CoderZhangYx Thanks for the quick and detailed response!
I have a follow-up question regarding the multimask_output
variable in SAM. Does SAM always predict three valid masks given one prompt? In the paper, the author uses multimask output to deal with ambiguous prompt:
The requirement of a “valid” mask simply means that even when a prompt is ambiguous and could refer to multiple objects(e.g., recall the shirt vs. person example, and see Fig. 3),the output should be a reasonable mask for at least one of those objects.
The number "three" is reasonable for dealing with an ambiguous visual prompt, eg. a point. However, when dealing with text prompt, for example given a vague category label such as "pedestrian", "car", and "zebra", the model should not be limited to output only three valid masks?
Thanks again!
Once multimask_output
is set to True
, the model will be forced to predict 3 masks corresponding to the visual prompt.
For Referring Expression Segmentation datasets (e.g. RefCOCO), the text-region pairs are unambiguous. Each text prompt only refers to an unique element in the image, so there is no need to worry about the ambiguity problem when using RES datasets. If you want to employ semantic segmentation datasets or instance segmentation datasets or any other segmentation datasets to fulfill an uni-model, some strategies may need to be considered to solve the ambiguity problem. We will present our strategies in our future update of Arxiv paper.
@CoderZhangYx Thanks for the reply! Looking forward to your updated manuscript!
Our latest release of checkpoint have supported semantic-level segmentation and part segmentation.
Thank authors for this wonderful work! I was wondering if EVF-SAM is able to segment multiple objects given a somewhat vague prompt? For instance in the zebra example, if given text prompt "zebra" (instead of "zebra top left"), would the model be able to give all 3 zebra detections?
I did this experiment and here's the result I got:
Any thoughts on how to extend the model's ability to segment multiple objects? I got that the task this paper trying to solve is referring object segmentation, and the datasets usually provide segment-text pairs referring only one object per image. I was wondering if it's possible to extend the model's ability to handle any kind of text prompts including vague ones that could correspond to multiple targets, since the VLM features generated from the Beit-3 models should be good enough to handle this.
Thanks!