Closed stevebottos closed 1 year ago
What information is missing there https://github.com/rom1504/clip-retrieval/discussions/100#discussioncomment-2023462 ?
Thanks for the prompt reply.
I did end up trying to match the process you mention above, by using the resizer class to first resize to 256 and then using the clip transform to downsize to 224. This reduced the norm difference for the example above down to about 0.1, which is closer, so maybe the difference is just due to some variations in my image decoder library versions.
I tried this myself and can confirm that the difference between the two embeddings are reduced usually, but after running a test with hundreds of embeddings there are still some that aren't similar enough to call it a solution.
Regardless, I ended up using your img2dataset library to download most all of laion400m myself, and then I am using clip-retrieval inference script to generate the embeddings myself. Was very helpful to have such code available
He doesn't mention whether or not using the tool for download and inference produces an exact match. I'll give it a shot and report back with the results if there's interest/if it's unknown?
The answer is simple: if you want to reproduce exactly the embeddings I computed for laion400m you need to employ exactly the same preprocessing
That includes the default resizing of img2dataset
However there is 1% link rot per month so some images will be gone. Some images will have changed. If you want to discard changed images you may use the hashes we provide
What is your end goal?
Good to know about the link rot, that makes sense. I'll see what I can do, I'm poking around img2dataset as we speak. The end goal is just to ensure that novel queries are computed correctly, I'm using the known embeddings compared with the embeddings produced by my preprocess/inference functions to error check before I trust that novel embeddings are correct.
The end goal is to pull all embeddings into Milvus and use ANN search to grab similar images for novel queries, pretty much what you've done here except entirely offline and using ANN instead of KNN.
I'll probably end up pulling all the images onto a server somewhere as well to ensure they're always available.
Clip-retrieval and milvus both use approximate knn which i guess is what you call ANN Faiss is still today the best opensource implementation of knn.
On Fri, May 5, 2023, 21:05 Steve Bottos @.***> wrote:
I'll probably end up pulling all the images onto a server somewhere as well to ensure they're always available.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/264#issuecomment-1536659322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XJSGT2A2RHGQWZJ5DXEVFQLANCNFSM6AAAAAAXWNVSV4 . You are receiving this because you commented.Message ID: @.***>
Got it. I'm just now starting to explore these vector dbs so the actual methods are still a bit fuzzy. In any case, I'll close this out, using the Resizer
class from img2dataset before preprocessing produces close enough results as was mentioned in the previous issue I linked, although still not perfect matches. I appreciate the responses.
This issue is related to an older one: https://github.com/rom1504/clip-retrieval/discussions/100 which didn't seem to be resolved entirely. I was able to replicate the results in this issue, but it's not a perfect solution (square pad images, use Resizer class prior to CLIP preprocess).
Anway, the laion website provides embeddings and parquets which tie an embedding at an index in the array to its associated metadata. In theory the clip output and the laion embedding for the same image should be exactly the same, but they're not. Here's how to reproduce:
This outputs:
I've verified that laion uses the ViT-B/32 backbone as well. I'm wondering what might be causing the discrepancy here. Any ideas? what I'm ultimately looking for is a lightweight addition to pre/post processing that will allow me to ensure that embeddings produced by novel queries are produced correctly.