[OWL ViT] - Potential use as a traditional object detector (no queries)

stevebottos commented 1 year ago

I've been doing some of my own experimentation with fine-tuning OWL ViT for standard object detection tasks as opposed to a zero/one shot query-based task with pretty good results. I'm just taking the box predictor + vision transformer as-is, which isn't touched during fine-tuning at all, and training a small MLP class predictor on the 768x576 image features that come out of the transformer layer (the same input that is passed into the box predictor). I'm using cross entropy with hard negative mining and scaling as the loss function for the class prediction head.

The intuition is that since this is a massively pre-trained model it would take very little to tune to an object detection task. This actually has proven to be a solid approach with < 30 examples yielding decent enough precision for tasks like auto-labeling and whatnot, where a human is able to label the missing detections. Where I'm hitting snags though are as follows:

More data doesn't improve performance. I get the same results with < 30 examples as I would with 100+ in terms of class confidence. Sometime even worse. My guess is that since the bounding box predictor isn't being trained, and since in order to obtain the embedding at a position in the image to make a class prediction we're reliant on the bounding boxes to be accurate (take the highest iou between the predicted bounding box and the ground truth box, use that index to obtain the class embedding) the class predictions can only be so accurate, which leads to the next issue
Simply unfreezing the box predictor and computing smooth_l1 between predicted and truth bounding boxes actually hurts performance, slightly. I'm more familiar with computing box loss in, say, an SSD where you have static anchor boxes - but there are no such boxes for this model (I believe we bias toward the center of the patch that the box was computed from).

To summarize, I'm wondering if it's possible at all to modify the model such that it might reach close to SOTA performance in detection tasks with much less data and far shorter training times due to the large amount of data seen during pre-training, or if it's more likely that fully convolutional models still have an edge over transformers in purely detection-based tasks. If not, any ideas for how I might modify the model to align with this new-ish objective?

MarioJarrinV commented 1 year ago

please can you help me, I need to fine tune Owl Vit but i can´t fine any clear tutorial about it

EY4L commented 1 year ago

Hi @stevebottos a little documentation for you're code would be great to be able to understand how to fine tune the model

stevebottos commented 1 year ago

Hey @MarioJarrinV @EY4L , I'll be getting back to work on my repo starting today if you'd like to check back from time to time for updates. I'll get some setup scripts in there as well.

Update: @MarioJarrinV @EY4L check the repo now, I've left some docs

EY4L commented 1 year ago

@stevebottos Hi, thanks so much for sharing your code and the documentation. I managed to train the model.

Would be great to add a section for doing inference on the trained model.

I've started to work on this and did some refactoring so hopefully I can share my work with you to bring some benefit!

Thank you

bgoldfe2 commented 1 year ago

@EY4L @stevebottos I am having a tough time figuring out how to fine tune. It may be how I get a data set into COCO format and then into TFDS format, I could be doing that wrong. Can you provide links to any documentation or code that can help? I dowloaded COCO dataset and am trying to start from there just to understand how to get anything to work with fine-tuning on owl-vit. Thanks for any help.

bgoldfe2 commented 1 year ago

@stevebottos @EY4L I had some configuration issues that I fixed. I am training now, so looks good. @stevebottos great documentation and code! Thanks.

stevebottos commented 1 year ago

@bgoldfe2 if you'd like feel free to open up an issue on my repo and I'd be happy to help. I'm away from my computer until next week but after that I'll be happy to provide assistance.

By the way, @EY4L and Bruce, I've been facing some strange issues in which more data seems to actually hurt performance, and best results seem to come after 2-5 epochs with the accuracy plateauing after that. I plan to investigate later, it doesn't seem like over fitting, my guess is that the large pretraining might have had the model converge at some minima and re-initialization of some weights might help but I'm not sure. I plan to investigate later but if you guys experience the same behavior I'd like to hear about your findings!

EY4L commented 1 year ago

@stevebottos Thanks for sharing.

I will update, currently focusing on generating inference output so I have a complete pipeline for object detection.

Then will go back to improve accuracy, fine tune, etc...

Thanks

mjlm commented 1 year ago

@stevebottos thank you for sharing your ideas and code. In our experience, fine-tuning OWL-ViT without any modifications, end-to-end, works pretty well also in the closed-vocab and few-shot setting (10s to 100s of examples per class).

Some thoughts:

Adapting the OWL-ViT architecture for closed-vocabulary detection: It may be preferable to get rid of the text encoder for efficiency, but in terms of detection performance it doesn't seem to hurt even for closed-vocabulary applications. Simply use the class names as queries. No need to change the losses from the original code.
Freezing the backbone: Generally fine-tuning end-to-end performs better than only tuning the heads even in the fewshot regime. If you have little data, you might want to try very short training schedules, e.g. hundreds or thousands of steps.
OWLv2 checkpoints are now available, they have been pretrained for longer and may fine-tune better than OWLv1.

Aki1991 commented 3 months ago

Hi all,

While using the fine tune steps mentioned here, I am getting error Did not find decoder for lvis:1.3.0. Please specify decoders for all datasets in DECODERS. Can anyone help me how I can specify decoder for any dataset.

hansa15100 commented 2 months ago

same error I am also getting Did not find decoder for lvis:1.3.0. Please specify decoders for all datasets in DECODERS

Aki1991 commented 2 months ago

Hi @hansa15100, I was finally able to solve the error.

At scenic/scenic/projects/owl_vit/preprocessing/input_pipeline.py line no 65, there is a list of DECODERS. You have to change the version number of livs to 1.3.0. That's it. If you want to use different dataset, you just have to add the name of your dataset and which decoder you want to use. You have to make some changes as shown here.

google-research / scenic

[OWL ViT] - Potential use as a traditional object detector (no queries) #714