Add Matching Anything by Segmenting Anything (MASA) MOT tracking model

rolson24 commented 4 months ago

Model description

I know the transformers library has not included object tracking models in the past, but this one can either plug into any object detection model or be an end-to-end open world tracking model by using a backbone line Grounding-Dino, DETR or Sam, all of which are already implemented in transformers. It achieves state of the art on the Open-vocabulary MOT benchmark.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

Authors: @siyuanliii, @lkeab, @martin-danelljan, @mattiasegu

Code: https://github.com/siyuanliii/masa/tree/main Weights: https://huggingface.co/dereksiyuanli/masa/tree/main Paper: https://arxiv.org/pdf/2406.04221

rolson24 commented 4 months ago

@qubvel

I am interested in contributing MASA to transformers. I'm wondering if you could give me some advice on how we should go about it. I read through the "How to add a model to transformers 🤗?" doc, but I am unsure if all the models should be added and if we should also the object tracking algorithm because the model alone is essentially just an appearance feature extractor.

qubvel commented 4 months ago

Hi, @rolson24, thanks for your interest! I would be glad to collaborate with you on this!

The model indeed looks interesting and can be a good complement for object detection/segmentation models in the library. However, we have a few concerns and risks regarding adding this model:

1) The model uses modulated deformable conv for masa adapter, thus we have to check if we can include custom cuda kernels and load them without issues with pytorch. MMdetection implementation probably would be a bit complicated to extract, however, there is another implementation here.

2) As far as I understand, there can be at least two model modifications, and both of them are not transformers actually:

with provided boxes: resnet50 backbone + masa adapter + tracking head -> not a transformer model, but can be used with any detection model for tracking
no provided boxes: detection model (e.g. grounding dino) for feature extraction and detection + masa adapter + tracking head -> we can reuse existing model from transformers, but other parts are also not transformers

3) It requires a Tracker, a new object that was not introduced previously in transformers, without Tracker the model is not that usable.

@rolson24 let me know if I missing something!

We discussed a bit with @amyeroberts and there might be an option to add a model on the hub as a standalone module, without including it directly in the library, but anyone would be able to use it with transformers and trust_remote_code=True.

Maybe @NielsRogge also has an opinion regarding this.

rolson24 commented 4 months ago

Ya that all makes sense. Seeing as many of the new modules are not actually transformer based, it would probably be better to do the standalone module, similar to what NVIDIA did with Mamba Vision. Maybe I will start by trying to write a version of the modulated deformable convolution layer. It seems like pytorch implemented the modulated deformable convolution in 2020: https://github.com/pytorch/vision/blob/3e60dbd590d1aef53e443d0d2dcb792a91afe481/torchvision/ops/deform_conv.py#L14 So it shouldn't be too bad to write.

qubvel commented 4 months ago

similar to what NVIDIA did with Mamba Vision

yes, exactly!

Maybe I will start by trying to write a version of the modulated deformable convolution layer.

Sounds great, let me know if you resolve this!

I checked Grounding DINO weight in transformers and provided by MASA repo, it looks like there are some differences in layernorm weights which cause small differences in feature logits passed to the masa adapter.

rolson24 commented 4 months ago

I think I have a pretty good implementation of the modulated deformable convolution block, but I had to modify it slightly because the original MASA adapter was performing an upsample after each convolution block instead of before which was causing the input to the DeformConv block to be too small. Somehow it was working with the MMCV implementation (it didn't have shape assertions), but the pytorch version had some error checking and was catching it. I fixed it to now do the correct operation, but the original MASA adapter was trained on the incorrect method and they are not equivalent operations, so the models will probably need to be fine-tuned or retrained with the new code.

huggingface / transformers