microsoft / GLIP

Grounded Language-Image Pre-training
MIT License
2.15k stars 190 forks source link

Try OV-DINO, a more powerful open-vocabulary detector. #172

Open wanghao9610 opened 1 month ago

wanghao9610 commented 1 month ago

Thanks for the awesome GLIP, I share our recent work 🦖OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion.

We have released the evaluation, fine-tuning, demo code in our project, feel free to try our model for your application.

Project: https://wanghao9610.github.io/OV-DINO

Paper: https://arxiv.org/abs/2407.07844

Code: https://github.com/wanghao9610/OV-DINO

Demo: http://47.115.200.157:7860/

Welcome everyone to try our model and feel free to raise issue if you encounter any problem.

crazness commented 1 month ago

How much data did you use to train the model, and what is the number of parameters?

wanghao9610 commented 1 month ago

@crazness OV-DINO is pre-trained on diverse data sources within a unified framework, including O365, GoldG, CC1M‡ datasets. O365 and GoldG datasets are same with GLIP, CC1M‡ is only 1M image-text pairs much less than Cap4M / Cap24M in GLIP, but OV-DINO achieve better performance. And OV-DINO has 166M paprameters, while GLIP is 232M paprameters. You could find more detail in our Paper.