Using this for finding top k most similar images in the clusters.

zaiyan-alam commented 1 month ago

I have a large image dataset without labels. My application is : given an input image find the k most similar images after the clusters are formed on my large image dataset. Can you please let me know? I work at FAANG and want to experiment it for image to image similarity for a project. I went thru the evaluation.py and run_turtle.py scripts, and feel confused how to do this for one single input image. Thanks!

agadetsky commented 1 month ago

Dear @zaiyan-alam

If I got your problem correctly, you want to find top-k similar images within the cluster the test sample is assigned to. First, you need to train the task encoder on the given dataset. In case you don't have the ground truth labels, you can check https://github.com/mlbio-epfl/turtle/issues/1. Next, assuming you have TURTLE 1-space, i.e., only single embedding space was used to train TURTLE, you can first construct cluster indices using https://github.com/facebookresearch/faiss or similar efficient libraries for similarity search. Thus, after you constructed the indices, you have the index for each class containing training samples assigned to the corresponding class, and you have a task encoder that can be used to assign a test sample to the particular cluster. Combining these two, you can first assign a test sample to a cluster and then use index built with faiss to find top-k similar images within the corresponding cluster of images.

Let me know if these clarifications helped to solve your issue.

Best, Artyom

zaiyan-alam commented 4 weeks ago

Thanks!

mlbio-epfl / turtle

Using this for finding top k most similar images in the clusters. #3