rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.25k stars 203 forks source link

Directly get image from indices #377

Open Maxwells-Demons opened 3 months ago

Maxwells-Demons commented 3 months ago

Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments:

    indices_paths="indices_paths_ViTL14.json"
    clip_model="ViT-L/14"
    enable_hdf5=False
    enable_faiss_memory_mapping=True
    columns_to_return = ["url", "image_path", "caption", "NSFW"]
    reorder_metadata_by_ivf_index=False
    enable_mclip_option=True
    use_jit=False
    use_arrow=True
    provide_safety_model=False
    provide_violence_detector=False
    provide_aesthetic_embeddings=True

    clip_resources = load_clip_indices(
        indices_paths=indices_paths,
        clip_options=ClipOptions(
            indice_folder="",
            clip_model=clip_model,
            enable_hdf5=enable_hdf5,
            enable_faiss_memory_mapping=enable_faiss_memory_mapping,
            columns_to_return=columns_to_return,
            reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index,
            enable_mclip_option=enable_mclip_option,
            use_jit=use_jit,
            use_arrow=use_arrow,
            provide_safety_model=provide_safety_model,
            provide_violence_detector=provide_violence_detector,
            provide_aesthetic_embeddings=provide_aesthetic_embeddings,
        ),
    )
    knnservice = KnnService(clip_resources=clip_resources)

In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466

results = self.map_to_metadata(
      indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return
)

and in results, I could only have 'url' and 'caption' like:

[{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG',
  'caption': 'Soul Safari Holistic Retreats'}]

But I noticed that the indices are list like:

[193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260]

As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices?

rom1504 commented 3 months ago

Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded

On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.***> wrote:

Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments:

indices_paths="indices_paths_ViTL14.json"
clip_model="ViT-L/14"
enable_hdf5=False
enable_faiss_memory_mapping=True
columns_to_return = ["url", "image_path", "caption", "NSFW"]
reorder_metadata_by_ivf_index=False
enable_mclip_option=True
use_jit=False
use_arrow=True
provide_safety_model=False
provide_violence_detector=False
provide_aesthetic_embeddings=True

clip_resources = load_clip_indices(
    indices_paths=indices_paths,
    clip_options=ClipOptions(
        indice_folder="",
        clip_model=clip_model,
        enable_hdf5=enable_hdf5,
        enable_faiss_memory_mapping=enable_faiss_memory_mapping,
        columns_to_return=columns_to_return,
        reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index,
        enable_mclip_option=enable_mclip_option,
        use_jit=use_jit,
        use_arrow=use_arrow,
        provide_safety_model=provide_safety_model,
        provide_violence_detector=provide_violence_detector,
        provide_aesthetic_embeddings=provide_aesthetic_embeddings,
    ),
)
knnservice = KnnService(clip_resources=clip_resources)

In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466

results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return )

and in results, I could only have 'url' and 'caption' like:

[{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}]

But I noticed that the indices are list like:

[193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260]

As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Maxwells-Demons commented 3 months ago

Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.> wrote: Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments: indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources) In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466 results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return ) and in results, I could only have 'url' and 'caption' like: [{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}] But I noticed that the indices are list like: [193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260] As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices? — Reply to this email directly, view it on GitHub <#377>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.>

Emm, sorry for I'm not very familiar with these. Could you please explain them in more detail? e.g. "urls pointing to an http service where you host the images"

BTW, I noticed that in the dataframe of a parquet file, the key image_path exists:

       image_path                                            caption      NSFW  ...  original_height                                               exif                                             sha256
0       000000007  Ben Affleck Could Be Latest Addition To <em>Th...  UNLIKELY  ...              320                                                 {}  6561021576f886c0334b06955cea13e973101f296e0280...
1       000000015  60 Pcs Table Decorations Supplies Moana Themed...  UNLIKELY  ...              200                                                 {}  2432d4ca862e078d911e9becdd7aa7bd85e5832ec5e44f...
2       000000001  Silverline Air Framing Nailer 90mm 10 - 12 Gau...  UNLIKELY  ...              225                                                 {}  b453f327a45b2b734772d8b38d12c1a441b0d69ceb458e...
3       000000049                Mini girls green crochet floral top  UNLIKELY  ...              300                                                 {}  0ba5c4d3842b670ec67a95227121c84944d73436b95fcf...
4       000000075  HARRY CHAPIN - Soundstage: An Evening With Har...  UNLIKELY  ...              200                                                 {}  1cc2add844cdab60decf867ba4242e88fa95b814e6799b...
...

But it didn't come up in retrieved meta (with only caption and url) when I start KnnService with Arrow file. If I directly use parquet files (use_arrow=False) to start KnnService, there are image_path in retrieved metas, but it didn't match with indices:

>>> metas[0]['image_path'], indices[0]
('194406653', 193396883)
rom1504 commented 3 months ago

Can you explain from the beginning what you are trying to do ? I don't follow

On Mon, Mar 18, 2024, 3:09 PM Maxwells_Ayakashi @.***> wrote:

Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded … <#m3694132828550334832> On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.> wrote: Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments: indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources) In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466 results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return ) and in results, I could only have 'url' and 'caption' like: [{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}] But I noticed that the indices are list like: [193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260] As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices? — Reply to this email directly, view it on GitHub <#377 https://github.com/rom1504/clip-retrieval/issues/377>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.>

Emm, sorry for I'm not very familiar with these. Could you please explain them in more detail? e.g. "urls pointing to an http service where you host the images"

BTW, I noticed that in the dataframe of a parquet file, the key image_path exists:

   image_path                                            caption      NSFW  ...  original_height                                               exif                                             sha256

0 000000007 Ben Affleck Could Be Latest Addition To Th... UNLIKELY ... 320 {} 6561021576f886c0334b06955cea13e973101f296e0280... 1 000000015 60 Pcs Table Decorations Supplies Moana Themed... UNLIKELY ... 200 {} 2432d4ca862e078d911e9becdd7aa7bd85e5832ec5e44f... 2 000000001 Silverline Air Framing Nailer 90mm 10 - 12 Gau... UNLIKELY ... 225 {} b453f327a45b2b734772d8b38d12c1a441b0d69ceb458e... 3 000000049 Mini girls green crochet floral top UNLIKELY ... 300 {} 0ba5c4d3842b670ec67a95227121c84944d73436b95fcf... 4 000000075 HARRY CHAPIN - Soundstage: An Evening With Har... UNLIKELY ... 200 {} 1cc2add844cdab60decf867ba4242e88fa95b814e6799b... ...

But it didn't come up in retrieved meta (with only caption and url) when I start KnnService with Arrow file. If I directly use parquet files ( use_arrow=False) to start KnnService, there are image_path in retrieved metas, but it didn't match with indices:

metas[0]['image_path'], indices[0] ('194406653', 193396883)

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/377#issuecomment-2004021940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QD6N5KHBFSIKQV3VTYY3YP5AVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGAZDCOJUGA . You are receiving this because you commented.Message ID: @.***>

Maxwells-Demons commented 3 months ago

First, I downloaded laion400m and launched KnnService with use_arrow=True, the retrieved metas only contain 'url' and 'caption', in the query function, I noticed there is a list indices. Second, I launch KnnService with use_arrow=False (eg. using metadata parquet files). Now the metas contain 'image_path', but it is different from the indices in corresponding index.

So my question is: Is it possible to access image data locally from a index number? If so, which is the correct index? Besides, you mentioned:

replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded Could you please explain them in more detail?

Thank you so much for taking time to answer my question. I apologize for any misunderstood caused.

rom1504 commented 3 months ago

Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?

Maxwells-Demons commented 3 months ago

Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?

In python file, since I'm going to do retrieval augmented generation.