Adding OWL-ViT to HuggingFace Transformers

alaradirik commented 2 years ago

Hi,

I've implemented OWL-ViT as a fork of 🤗 HuggingFace Transformers, and we are planning to add it to the library soon (see https://github.com/huggingface/transformers/pull/17938). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1IMPWZcnlMy-tdnTDrUcOZU3oiGg-hTem?usp=sharing

I really like the simplicity of OWL-ViT and there are so many potential use cases for open-vocabulary object detection, especially within the robotics community, so we are all excited to it add it to transformers.

As you may or may not know, each model on the HuggingFace hub has its own Github repository. For example, the OWL-ViT-base-patch32 checkpoint can be found here. If you check the "files and versions" tab, you can find the converted weights of the model. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

If you haven't done so, would you be interested in joining the Google organisation on the hub, such that we can store all model checkpoints there (rather than under my username)?

Let me know!

Kind regards,

Alara ML Engineer @ HuggingFace

AlexeyG commented 2 years ago

Hi Alara,

Many thanks for doing the work of implementing OWL-ViT in HuggingFace. This is really cool and an exciting thing for us.

I joined the Google AI org on Hugging Face hub. Do I need to take any additional steps on my side at this stage?

Thank you, Alexey

alaradirik commented 2 years ago

Great, thank you Alexey! No need to take any additional steps, we will transfer the repo ownership to the Google AI org shortly.

Best, Alara

NielsRogge commented 2 years ago

The model is now available here: https://huggingface.co/docs/transformers/model_doc/owlvit.

We shared it on Twitter and LinkedIn and people seem to really like it :D

i2mironov commented 2 years ago

Is there any chance you'll implement one-shot object detection part of the model? (into hugginface model)

alaradirik commented 2 years ago

Hi @i2mironov, yes, we are currently working on it and you can expect to see it as a part of the HF OWL-ViT model very soon.

Hope this helps :) Alara

chtlp commented 2 years ago

Hi, great work about the HF model! We are looking forward to the one-shot object detection function as well, hopefully with a Gradio demo :)

tejas-gokhale commented 2 years ago

@alaradirik I'm trying to run OWL-ViT object detection on GPU. However, this results in the error:

Traceback (most recent call last):
  File "debug.py", line 16, in <module>
    outputs = model(**inputs)
  File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/transformers/models/owlvit/modeling_owlvit.py", line 1373, in forward
    pred_boxes = self.box_predictor(image_feats, feature_map)
  File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/transformers/models/owlvit/modeling_owlvit.py", line 1223, in box_predictor
    pred_boxes += self.compute_box_bias(feature_map)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

see the code below:

import requests
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
model = model.to('cuda')

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
inputs = inputs.to('cuda')
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

score_threshold = 0.1
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    if score >= score_threshold:
        print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

Could you take a look? FYI, I am using python 3.8.13 and transformers 4.21.3

alaradirik commented 2 years ago

Hi @tejas-gokhale, we fixed this bug a few weeks ago with this PR but the changes are not released as a PyPI package yet.

The next release is actually scheduled for today so you can either update the package tomorrow or install transformers from source: pip install git+https://github.com/huggingface/transformers.git

tejas-gokhale commented 2 years ago

Thanks @alaradirik -- I can confirm that installing transformers via source (4.23.0.dev0) works!

zzh-tech commented 2 years ago

Great implementation! I'm looking forward to the one-shot object detection function too!

timothylimyl commented 1 year ago

> outputs = model(**inputs)
> 
> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
> target_sizes = torch.Tensor([image.size[::-1]])
> # Convert outputs (bounding boxes and class logits) to COCO API
> results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

@tejas-gokhale , how did your code manage to run? The outputs has not been converted back to CPU.

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
    results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

I have already updated to transformers 4.24.0

TracyYXChen commented 1 year ago

@alaradirik Hi Alara, thanks for the HuggingFace implementation! I noticed that it increases about 0.6GB RAM consumption per image, so if I run it with many images, e.g., 100 cat images, then it will take 60GB RAM; any idea how to reduce the RAM? Thanks!

For example, if you run this on a laptop, the process will be killed soon:

import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
numImages = 100
urls = ["http://images.cocodataset.org/val2017/000000039769.jpg"] * numImages
cnt = 0
for url in urls:
    cnt += 1
    print(cnt)
    image = Image.open(requests.get(url, stream=True).raw)
    texts = [["a photo of a cat", "a photo of a dog"]]
    inputs = processor(text=texts, images=image, return_tensors="pt")
    outputs = model(**inputs)

    # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
    target_sizes = torch.Tensor([image.size[::-1]])
    # Convert outputs (bounding boxes and class logits) to COCO API
    results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

    i = 0  # Retrieve predictions for the first image for the corresponding text queries
    text = texts[i]
    boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

    score_threshold = 0.1
    for box, score, label in zip(boxes, scores, labels):
        box = [round(i, 2) for i in box.tolist()]
        if score >= score_threshold:
            print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

alaradirik commented 1 year ago

Hi @TracyYXChen, happy to hear from people using it!

Could you try setting the model in evaluation model and using torch.no_grad()?

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
model.eval()

with torch.no_grad():
    outputs = model(**inputs)

TracyYXChen commented 1 year ago

Hi @TracyYXChen, happy to hear from people using it!

Could you try setting the model in evaluation model and using torch.no_grad()?
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
model.eval()

with torch.no_grad():
    outputs = model(**inputs)

Hi @alaradirik, thanks for your prompt reply! It works!

SeungyounShin commented 1 year ago

Do we have train code for hf-owl-vit?

sheethalb commented 1 year ago

Hi, How can i use the OwlViTModelto fine-tune only the detection part for my custom dataset? can i just do following -

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32") model.train() .. .. Do we need the input data in the COCO API format , before we pass it to the model for training?

alex-bene commented 1 year ago

Hello guys. I've been using the owl-vit v1 for a while and saw the new release of the v2 version. Since OWL-ViT v2 checkpoints are drop-in replacements for v1 is it possible to see the v2 checkpoints uploaded in hugging face in the next few days?

NielsRogge commented 1 year ago

Hi folks! OWLv2 is now available: https://huggingface.co/models?other=owlv2

mjlm commented 1 year ago

Thanks Niels!

phamnhuvu-dev commented 9 months ago

Do we have train code for hf-owl-vit?

https://github.com/alaradirik/transformers/blob/main/tests/models/owlvit/test_modeling_owlvit.py#L313-L320

google-research / scenic

Adding OWL-ViT to HuggingFace Transformers #413