google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond
Apache License 2.0
3.3k stars 430 forks source link

For OWL-ViT, is there a demo which shows the way using image patch as querys to do one-shot detection? #325

Closed Edwardmark closed 2 years ago

Edwardmark commented 2 years ago

Hi, thanks for your great work. And the demo of text zero-shot is amazing. For OWL-ViT, is there a demo which shows the way using image patch as querys to do one-shot detection? Thanks.

mjlm commented 2 years ago

Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.

Edwardmark commented 2 years ago

@mjlm And what prompt is used in coco evaluation? In the paper, it says it uses the seven best prompts, so what are the seven best text prompts? Thanks.

AlexeyG commented 2 years ago

The prompts can be found in the CLIP repository. During inference we used the 7 ensembling prompts from the colab.

stevebottos commented 2 years ago

Is this still in the works? I've been interested in seeing how image input queries could be used as well

xishanhan commented 2 years ago

Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.

Hi, is this one-shot detection demo finished? I'm also very interested in it and want to try.

mjlm commented 2 years ago

We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.

xishanhan commented 2 years ago

We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.

That would be very nice, thank you!

mjlm commented 2 years ago

We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:

OWL-ViT text inference demo OWL-ViT image inference demo

The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131

Let us know if you have any questions!

xishanhan commented 2 years ago

We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:

OWL-ViT text inference demo OWL-ViT text inference demo OWL-ViT image inference demo OWL-ViT image inference demo

The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131

Let us know if you have any questions!

Thanks for your reply!I don't have any problems now.

BIGBALLON commented 1 year ago

Hi, @mjlm , thanks for your great work!

I wonder if there are any plans to implement multi-query image-conditioned detection.

Sometimes a single query image is often unable to capture all the features of an object, and using multiple query images to represent it can yield better results.

thanks again!

mjlm commented 1 year ago

You can simply average the embeddings of multiple boxes to get a query embedding. This is how we implemented few-shot (i.e. more than one-shot) detection in the paper.

890 will add example code for image-conditioned detection to the colab. The example shows how to get a query_embedding from the class_embeddings of the source (query) image. If you have e.g. two query embeddings representing the same object, you can simply do two_shot_query_embedding = (query_embedding_1 + query_embedding_2) / 2. This simple method worked for us. Another option would be to keep the embeddings separate, but map them to the same class after classification.