Open Maxwells-Demons opened 3 months ago
Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded
On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.***> wrote:
Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments:
indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources)
In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466
results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return )
and in results, I could only have 'url' and 'caption' like:
[{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}]
But I noticed that the indices are list like:
[193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260]
As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices?
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded … On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.> wrote: Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments: indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources) In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466 results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return ) and in results, I could only have 'url' and 'caption' like: [{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}] But I noticed that the indices are list like: [193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260] As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices? — Reply to this email directly, view it on GitHub <#377>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.>
Emm, sorry for I'm not very familiar with these. Could you please explain them in more detail? e.g. "urls pointing to an http service where you host the images"
BTW, I noticed that in the dataframe of a parquet file, the key image_path
exists:
image_path caption NSFW ... original_height exif sha256
0 000000007 Ben Affleck Could Be Latest Addition To <em>Th... UNLIKELY ... 320 {} 6561021576f886c0334b06955cea13e973101f296e0280...
1 000000015 60 Pcs Table Decorations Supplies Moana Themed... UNLIKELY ... 200 {} 2432d4ca862e078d911e9becdd7aa7bd85e5832ec5e44f...
2 000000001 Silverline Air Framing Nailer 90mm 10 - 12 Gau... UNLIKELY ... 225 {} b453f327a45b2b734772d8b38d12c1a441b0d69ceb458e...
3 000000049 Mini girls green crochet floral top UNLIKELY ... 300 {} 0ba5c4d3842b670ec67a95227121c84944d73436b95fcf...
4 000000075 HARRY CHAPIN - Soundstage: An Evening With Har... UNLIKELY ... 200 {} 1cc2add844cdab60decf867ba4242e88fa95b814e6799b...
...
But it didn't come up in retrieved meta (with only caption and url) when I start KnnService with Arrow file. If I directly use parquet files (use_arrow=False
) to start KnnService, there are image_path
in retrieved metas, but it didn't match with indices:
>>> metas[0]['image_path'], indices[0]
('194406653', 193396883)
Can you explain from the beginning what you are trying to do ? I don't follow
On Mon, Mar 18, 2024, 3:09 PM Maxwells_Ayakashi @.***> wrote:
Hi, you can replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded … <#m3694132828550334832> On Mon, Mar 18, 2024, 12:43 PM Maxwells_Ayakashi @.> wrote: Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments: indices_paths="indices_paths_ViTL14.json" clip_model="ViT-L/14" enable_hdf5=False enable_faiss_memory_mapping=True columns_to_return = ["url", "image_path", "caption", "NSFW"] reorder_metadata_by_ivf_index=False enable_mclip_option=True use_jit=False use_arrow=True provide_safety_model=False provide_violence_detector=False provide_aesthetic_embeddings=True clip_resources = load_clip_indices( indices_paths=indices_paths, clip_options=ClipOptions( indice_folder="", clip_model=clip_model, enable_hdf5=enable_hdf5, enable_faiss_memory_mapping=enable_faiss_memory_mapping, columns_to_return=columns_to_return, reorder_metadata_by_ivf_index=reorder_metadata_by_ivf_index, enable_mclip_option=enable_mclip_option, use_jit=use_jit, use_arrow=use_arrow, provide_safety_model=provide_safety_model, provide_violence_detector=provide_violence_detector, provide_aesthetic_embeddings=provide_aesthetic_embeddings, ), ) knnservice = KnnService(clip_resources=clip_resources) In the code of clip_retrieval/clip_back.py, KnnService.query, Line 466 results = self.map_to_metadata( indices, distances, num_images, clip_resource.metadata_provider, clip_resource.columns_to_return ) and in results, I could only have 'url' and 'caption' like: [{'url': 'https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG https://s3.us-west-2.amazonaws.com/prod.retreat.guru/images/16212/medium/photo%20%280000000E%29.JPG', 'caption': 'Soul Safari Holistic Retreats'}] But I noticed that the indices are list like: [193396883, 169693704, 226852546, 94594796, 10774506, 139003161, 3917167, 217605597, 191966779, 197146260] As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices? — Reply to this email directly, view it on GitHub <#377 https://github.com/rom1504/clip-retrieval/issues/377>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ https://github.com/notifications/unsubscribe-auth/AAR437XURHE46TVG6W4CQKDYY3HOJAVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TCOJXGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.>
Emm, sorry for I'm not very familiar with these. Could you please explain them in more detail? e.g. "urls pointing to an http service where you host the images"
BTW, I noticed that in the dataframe of a parquet file, the key image_path exists:
image_path caption NSFW ... original_height exif sha256
0 000000007 Ben Affleck Could Be Latest Addition To Th... UNLIKELY ... 320 {} 6561021576f886c0334b06955cea13e973101f296e0280... 1 000000015 60 Pcs Table Decorations Supplies Moana Themed... UNLIKELY ... 200 {} 2432d4ca862e078d911e9becdd7aa7bd85e5832ec5e44f... 2 000000001 Silverline Air Framing Nailer 90mm 10 - 12 Gau... UNLIKELY ... 225 {} b453f327a45b2b734772d8b38d12c1a441b0d69ceb458e... 3 000000049 Mini girls green crochet floral top UNLIKELY ... 300 {} 0ba5c4d3842b670ec67a95227121c84944d73436b95fcf... 4 000000075 HARRY CHAPIN - Soundstage: An Evening With Har... UNLIKELY ... 200 {} 1cc2add844cdab60decf867ba4242e88fa95b814e6799b... ...
But it didn't come up in retrieved meta (with only caption and url) when I start KnnService with Arrow file. If I directly use parquet files ( use_arrow=False) to start KnnService, there are image_path in retrieved metas, but it didn't match with indices:
metas[0]['image_path'], indices[0] ('194406653', 193396883)
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/377#issuecomment-2004021940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QD6N5KHBFSIKQV3VTYY3YP5AVCNFSM6AAAAABE3NF7HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGAZDCOJUGA . You are receiving this because you commented.Message ID: @.***>
First, I downloaded laion400m and launched KnnService with use_arrow=True
, the retrieved metas only contain 'url'
and 'caption'
, in the query function, I noticed there is a list indices
.
Second, I launch KnnService with use_arrow=False
(eg. using metadata parquet files). Now the metas contain 'image_path'
, but it is different from the indices
in corresponding index.
So my question is: Is it possible to access image data locally from a index number? If so, which is the correct index? Besides, you mentioned:
replace the urls in the metadata by urls pointing to an http service (eg nginx) where you host the images you have downloaded Could you please explain them in more detail?
Thank you so much for taking time to answer my question. I apologize for any misunderstood caused.
Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?
Where would you like the jpeg bytes to appear after querying the index exactly? Is it in the browser or some other place?
In python file, since I'm going to do retrieval augmented generation.
Hi, Rom. I have downloaded laion400m and launched KnnService with follow arguments:
In the code of
clip_retrieval/clip_back.py, KnnService.query, Line 466
and in
results
, I could only have 'url' and 'caption' like:But I noticed that the
indices
are list like:As you mentioned in other issues before, the url links are gradually becoming unavailable. I think It would be more possible to access it from files I have downloaded. So my question is: how could I directly get the image data from such indices?