Open alaradirik opened 2 years ago
Hi Alara,
Many thanks for doing the work of implementing OWL-ViT in HuggingFace. This is really cool and an exciting thing for us.
I joined the Google AI org on Hugging Face hub. Do I need to take any additional steps on my side at this stage?
Thank you, Alexey
Great, thank you Alexey! No need to take any additional steps, we will transfer the repo ownership to the Google AI org shortly.
Best, Alara
The model is now available here: https://huggingface.co/docs/transformers/model_doc/owlvit.
We shared it on Twitter and LinkedIn and people seem to really like it :D
Is there any chance you'll implement one-shot object detection part of the model? (into hugginface model)
Hi @i2mironov, yes, we are currently working on it and you can expect to see it as a part of the HF OWL-ViT model very soon.
Hope this helps :) Alara
Hi, great work about the HF model! We are looking forward to the one-shot object detection function as well, hopefully with a Gradio demo :)
@alaradirik I'm trying to run OWL-ViT object detection on GPU. However, this results in the error:
Traceback (most recent call last):
File "debug.py", line 16, in <module>
outputs = model(**inputs)
File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/transformers/models/owlvit/modeling_owlvit.py", line 1373, in forward
pred_boxes = self.box_predictor(image_feats, feature_map)
File "/data_2/data/tgokhale/tg_hf_tf_py38/lib/python3.8/site-packages/transformers/models/owlvit/modeling_owlvit.py", line 1223, in box_predictor
pred_boxes += self.compute_box_bias(feature_map)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
see the code below:
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
model = model.to('cuda')
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
inputs = inputs.to('cuda')
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
score_threshold = 0.1
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
if score >= score_threshold:
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Could you take a look? FYI, I am using python 3.8.13 and transformers 4.21.3
Hi @tejas-gokhale, we fixed this bug a few weeks ago with this PR but the changes are not released as a PyPI package yet.
The next release is actually scheduled for today so you can either update the package tomorrow or install transformers from source:
pip install git+https://github.com/huggingface/transformers.git
Thanks @alaradirik -- I can confirm that installing transformers
via source (4.23.0.dev0) works!
Great implementation! I'm looking forward to the one-shot object detection function too!
> outputs = model(**inputs)
>
> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
> target_sizes = torch.Tensor([image.size[::-1]])
> # Convert outputs (bounding boxes and class logits) to COCO API
> results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
@tejas-gokhale , how did your code manage to run? The outputs has not been converted back to CPU.
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
I have already updated to transformers 4.24.0
@alaradirik Hi Alara, thanks for the HuggingFace implementation! I noticed that it increases about 0.6GB RAM consumption per image, so if I run it with many images, e.g., 100 cat images, then it will take 60GB RAM; any idea how to reduce the RAM? Thanks!
For example, if you run this on a laptop, the process will be killed soon:
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
numImages = 100
urls = ["http://images.cocodataset.org/val2017/000000039769.jpg"] * numImages
cnt = 0
for url in urls:
cnt += 1
print(cnt)
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
score_threshold = 0.1
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
if score >= score_threshold:
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Hi @TracyYXChen, happy to hear from people using it!
Could you try setting the model in evaluation model and using torch.no_grad()
?
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
model.eval()
with torch.no_grad():
outputs = model(**inputs)
Hi @TracyYXChen, happy to hear from people using it!
Could you try setting the model in evaluation model and using
torch.no_grad()
?model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32") model.eval() with torch.no_grad(): outputs = model(**inputs)
Hi @alaradirik, thanks for your prompt reply! It works!
Do we have train code for hf-owl-vit?
Hi, How can i use the OwlViTModelto fine-tune only the detection part for my custom dataset? can i just do following -
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32") model.train() .. .. Do we need the input data in the COCO API format , before we pass it to the model for training?
Hello guys. I've been using the owl-vit v1 for a while and saw the new release of the v2 version. Since OWL-ViT v2 checkpoints are drop-in replacements for v1
is it possible to see the v2 checkpoints uploaded in hugging face in the next few days?
Hi folks! OWLv2 is now available: https://huggingface.co/models?other=owlv2
Thanks Niels!
Do we have train code for hf-owl-vit?
Hi,
I've implemented OWL-ViT as a fork of 🤗 HuggingFace Transformers, and we are planning to add it to the library soon (see https://github.com/huggingface/transformers/pull/17938). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1IMPWZcnlMy-tdnTDrUcOZU3oiGg-hTem?usp=sharing
I really like the simplicity of OWL-ViT and there are so many potential use cases for open-vocabulary object detection, especially within the robotics community, so we are all excited to it add it to transformers.
As you may or may not know, each model on the HuggingFace hub has its own Github repository. For example, the OWL-ViT-base-patch32 checkpoint can be found here. If you check the "files and versions" tab, you can find the converted weights of the model. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!
A model card can also be added to the repo, which is just a README.
If you haven't done so, would you be interested in joining the Google organisation on the hub, such that we can store all model checkpoints there (rather than under my username)?
Let me know!
Kind regards,
Alara ML Engineer @ HuggingFace