huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.66k stars 27.16k forks source link

Object detection training/fine-tuning for Owl-vit/Owlv2 #33664

Open catalys1 opened 2 months ago

catalys1 commented 2 months ago

Feature request

Currently the Owl-vit models support inference and CLIP-style contrastive pre-training, but don't provide a way to train (or fine-tune) the detection part of the model. According to the paper, detection training is similar to Detr.

Motivation

It would be really awesome to be able to train or fine-tune one of these already-existing open-vocabulary object detection models.

Your contribution

I may be able to help some with this, not sure at present

LysandreJik commented 2 months ago

cc @molbap @qubvel

qubvel commented 2 months ago

Hi @catalys1, thank you so much for your feature request! We agree that the ability to fine-tune Owl-vit/Owlv2 would be a great addition. If you have the time and are interested in contributing, we would love to collaborate with you on this! Your help would be greatly appreciated 🤗

This PR might be also helpful

daniel-bogdoll commented 1 month ago

@qubvel Is https://github.com/huggingface/transformers/pull/34057 maybe the solution here? I would also be happy to see the ability to finetune (some) ZeroShotObjectDetection models.

qubvel commented 1 month ago

Hi @daniel-bogdoll, not sure linked PR solved the issue, there is no object detection loss added there. cc @yonigozlan