Open catalys1 opened 2 months ago
cc @molbap @qubvel
Hi @catalys1, thank you so much for your feature request! We agree that the ability to fine-tune Owl-vit/Owlv2 would be a great addition. If you have the time and are interested in contributing, we would love to collaborate with you on this! Your help would be greatly appreciated 🤗
This PR might be also helpful
@qubvel Is https://github.com/huggingface/transformers/pull/34057 maybe the solution here? I would also be happy to see the ability to finetune (some) ZeroShotObjectDetection models.
Hi @daniel-bogdoll, not sure linked PR solved the issue, there is no object detection loss added there. cc @yonigozlan
Feature request
Currently the Owl-vit models support inference and CLIP-style contrastive pre-training, but don't provide a way to train (or fine-tune) the detection part of the model. According to the paper, detection training is similar to Detr.
Motivation
It would be really awesome to be able to train or fine-tune one of these already-existing open-vocabulary object detection models.
Your contribution
I may be able to help some with this, not sure at present