Add Grounding DINO - Githubissues

keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras

Other

977 stars 319 forks source link

Add Grounding DINO #2114

Open innat opened 8 months ago

innat commented 8 months ago

Short Description

Zero-shot object detection model.

Papers

https://arxiv.org/abs/2303.05499

Existing Implementations

https://github.com/IDEA-Research/GroundingDINO

Other Information

Combination with (a). stable diffusion or (b). segment anything, etc, the applications possibility are huge.
pre-requisite:
- image backbone: swin-transformer
- text backbone: bert

innat commented 5 months ago

TODO Components

[ ] Swin Transformer
[ ] Mult-scale Deform Attention, official-gdino, mmcv, official
[ ] DeformableTransformerEncoder/DecoderLayer
[ ] BiAttentionBlock (Bi-Direction MHA (text->image, image->text))

tirthasheshpatel commented 5 months ago

@innat Are you planning/volunteering to work on this or any of the components?

I see you proposed #2319 which seems like a replication of the SWIN transformer of the Grounding DINO implementation. Thanks for the PR!

This is next on my TODO list. Let me know if you want to take up something if you have time, I can help review and test! I can take the rest of the components and weights transfer. BTW the list of components with references is very useful, thanks!

innat commented 5 months ago

@tirthasheshpatel The https://github.com/keras-team/keras-cv/pull/2319 is about video-swin modelling, and I think the grounding-dino (g-dino) needs image-swin model, so this issue needs to be progressed first as a prerequisite of current issue. Here is one of the reimplementation of image-swin model in keras 2.

The above components of g-dino are some of high level components. But same as DETR, it also has custom cuda operations which might make complication to add. But other compoents can be added one by one initially. If you are currently working on it, please continue. If I could manage some time, I will contribute rest of the components. This kind of model (zsl detection) is quite useful and surly it will add value to keras-cv.