Object detection training/fine-tuning for Owl-vit/Owlv2

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.77k stars 26.46k forks source link

Object detection training/fine-tuning for Owl-vit/Owlv2 #33664

Open catalys1 opened 6 days ago

catalys1 commented 6 days ago

Feature request

Currently the Owl-vit models support inference and CLIP-style contrastive pre-training, but don't provide a way to train (or fine-tune) the detection part of the model. According to the paper, detection training is similar to Detr.

Motivation

It would be really awesome to be able to train or fine-tune one of these already-existing open-vocabulary object detection models.

Your contribution

I may be able to help some with this, not sure at present

LysandreJik commented 5 days ago

cc @molbap @qubvel

qubvel commented 5 days ago

Hi @catalys1, thank you so much for your feature request! We agree that the ability to fine-tune Owl-vit/Owlv2 would be a great addition. If you have the time and are interested in contributing, we would love to collaborate with you on this! Your help would be greatly appreciated 🤗

This PR might be also helpful

https://github.com/huggingface/transformers/pull/31828