Open rolson24 opened 4 months ago
@qubvel
I am interested in contributing MASA to transformers. I'm wondering if you could give me some advice on how we should go about it. I read through the "How to add a model to transformers 🤗?" doc, but I am unsure if all the models should be added and if we should also the object tracking algorithm because the model alone is essentially just an appearance feature extractor.
Hi, @rolson24, thanks for your interest! I would be glad to collaborate with you on this!
The model indeed looks interesting and can be a good complement for object detection/segmentation models in the library. However, we have a few concerns and risks regarding adding this model:
1) The model uses modulated deformable conv for masa adapter, thus we have to check if we can include custom cuda kernels and load them without issues with pytorch. MMdetection implementation probably would be a bit complicated to extract, however, there is another implementation here.
2) As far as I understand, there can be at least two model modifications, and both of them are not transformers
actually:
3) It requires a Tracker, a new object that was not introduced previously in transformers, without Tracker the model is not that usable.
@rolson24 let me know if I missing something!
We discussed a bit with @amyeroberts and there might be an option to add a model on the hub as a standalone module, without including it directly in the library, but anyone would be able to use it with transformers
and trust_remote_code=True
.
Maybe @NielsRogge also has an opinion regarding this.
Ya that all makes sense. Seeing as many of the new modules are not actually transformer based, it would probably be better to do the standalone module, similar to what NVIDIA did with Mamba Vision. Maybe I will start by trying to write a version of the modulated deformable convolution layer. It seems like pytorch implemented the modulated deformable convolution in 2020: https://github.com/pytorch/vision/blob/3e60dbd590d1aef53e443d0d2dcb792a91afe481/torchvision/ops/deform_conv.py#L14 So it shouldn't be too bad to write.
similar to what NVIDIA did with Mamba Vision
yes, exactly!
Maybe I will start by trying to write a version of the modulated deformable convolution layer.
Sounds great, let me know if you resolve this!
I checked Grounding DINO weight in transformers and provided by MASA repo, it looks like there are some differences in layernorm weights which cause small differences in feature logits passed to the masa adapter.
I think I have a pretty good implementation of the modulated deformable convolution block, but I had to modify it slightly because the original MASA adapter was performing an upsample after each convolution block instead of before which was causing the input to the DeformConv block to be too small. Somehow it was working with the MMCV implementation (it didn't have shape assertions), but the pytorch version had some error checking and was catching it. I fixed it to now do the correct operation, but the original MASA adapter was trained on the incorrect method and they are not equivalent operations, so the models will probably need to be fine-tuned or retrained with the new code.
Model description
I know the transformers library has not included object tracking models in the past, but this one can either plug into any object detection model or be an end-to-end open world tracking model by using a backbone line Grounding-Dino, DETR or Sam, all of which are already implemented in transformers. It achieves state of the art on the Open-vocabulary MOT benchmark.
Open source status
Provide useful links for the implementation
Authors: @siyuanliii, @lkeab, @martin-danelljan, @mattiasegu
Code: https://github.com/siyuanliii/masa/tree/main Weights: https://huggingface.co/dereksiyuanli/masa/tree/main Paper: https://arxiv.org/pdf/2406.04221