MASA adapter usage with segmentation foundation models

ederev commented 3 months ago

Hello! Thank you for your great and interesting work. I have a question regarding MASA adapter usage with segmentation models. In your article it is stated that you have "designed a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects". In provided demo script: demo/video_demo_with_text.py there is a support (except for unified model) for only detection + adapter usage and post-processing segmentation via SAM on already processed video with tracks. But it is quite not aligned with described design. So my question is: could you please provide more details on usage MASA adapter + Segmentation ? Maybe how exactly should I use demo script (in case it's applicable) or code snapshot? Did I catch the idea behind the inference correctly ref figure 3 (b) in https://arxiv.org/pdf/2406.04221 ?

Also it is not clear how on Figure 12. Qualitative Comparison between MASA and Deva is conducted in terms of models usage.

Many thanks for considering my request.

siyuanliii commented 2 months ago

Thanks for the question! “MASA adapter + Segmentation” means that we use SAM as the base detection model. SAM will output masks for every object in the scene and the MASA is responsible for associating them. Deva also requires a pre-trained model to provide instance masks for multiple object tracking and segmentation tasks. Figure 12 shows the comparison between MASA and Deva on BDD100K sequences using the same instance segmentation model(UNINEXT) to provide masks, then using masa and deva for association.

From: ederev @.> Reply to: siyuanliii/masa @.> Date: Thursday, 29 August 2024 at 14:37 To: siyuanliii/masa @.> Cc: Subscribed @.> Subject: [siyuanliii/masa] MASA adapter usage with segmentation foundation models (Issue #31)

Hello! Thank you for your great and interesting work. I have a question regarding MASA adapter usage with segmentation models. In your article it is stated that you have "designed a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects". In provided demo script: demo/video_demo_with_text.py there is a support (except for unified model) for only detection + adapter usage and post-processing segmentation via SAM on already processed video with tracks. But it is quite not aligned with described design. So my question is: could you please provide more details on usage MASA adapter + Segmentation ? Maybe how exactly should I use demo script (in case it's applicable) or code snapshot? Did I catch the idea behind the inference correctly ref figure 3 (b) in https://arxiv.org/pdf/2406.04221 ?

Also it is not clear how on Figure 12. Qualitative Comparison between MASA and Deva is conducted in terms of models usage.

Thanks in advance.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

[ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/siyuanliii/masa/issues/31", "url": "https://github.com/siyuanliii/masa/issues/31", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

ederev commented 2 months ago

Alright, thank you. As far as I know, DEVA used SAM masks directly, but in MASA adapter there is no direct usage of segmentation masks, only detection bboxes. Thus, if I understood correctly, that in your current implementation of MASA adapter for segmentation I should get segmentation masks first and after that convert them to bboxes to use as detection model. Am I right? But in this case it's not clear how it helps to improve segmentation (except for id assotiation for masks inscribed into bboxes) , if we do the same bbox tracking.

I would appreciate any guidance on this issue.

siyuanliii commented 2 months ago

Thanks! "I should get segmentation masks first and after that convert them to bboxes to use as detection model." SAM is actually a prompt-driven model, it doesn't give you masks directly. The prompt can be bboxes, or points. Thus, the actual order is you get the bbox first as prompt then call SAM to give the mask. Those boxes can be tracked using MASA. MASA is a pure appearance model for the association, thus it is not intended to improve segmentation performance.

siyuanliii / masa

MASA adapter usage with segmentation foundation models #31