Closed Edwardmark closed 2 years ago
Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.
@mjlm And what prompt is used in coco evaluation? In the paper, it says it uses the seven best prompts, so what are the seven best text prompts? Thanks.
The prompts can be found in the CLIP repository. During inference we used the 7 ensembling prompts from the colab.
Is this still in the works? I've been interested in seeing how image input queries could be used as well
Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.
Hi, is this one-shot detection demo finished? I'm also very interested in it and want to try.
We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.
We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.
That would be very nice, thank you!
We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:
The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131
Let us know if you have any questions!
We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:
The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131
Let us know if you have any questions!
Thanks for your reply!I don't have any problems now.
Hi, @mjlm , thanks for your great work!
I wonder if there are any plans to implement multi-query image-conditioned detection.
Sometimes a single query image is often unable to capture all the features of an object, and using multiple query images to represent it can yield better results.
thanks again!
You can simply average the embeddings of multiple boxes to get a query embedding. This is how we implemented few-shot (i.e. more than one-shot) detection in the paper.
query_embedding
from the class_embeddings
of the source (query) image. If you have e.g. two query embeddings representing the same object, you can simply do two_shot_query_embedding = (query_embedding_1 + query_embedding_2) / 2
. This simple method worked for us. Another option would be to keep the embeddings separate, but map them to the same class after classification.
Hi, thanks for your great work. And the demo of text zero-shot is amazing. For OWL-ViT, is there a demo which shows the way using image patch as querys to do one-shot detection? Thanks.