Closed mrjackbo closed 1 year ago
@mrjackbo Thank you for pointing that out. Previously, we conducted a simple experiment to show that the _transform_ndarray
wouldn't harm the downstream task (including retrieval, and zero-shot classification tasks). Thus, we made the conclusion that the embeddings from the same transform
operation would be acceptable.
However, based on your question:
Are the text embeddings now slightly misaligned to the image embeddings?
I think you are right, we did not consider this use case. We should use the _transform_blob
that potentially improve the text-image retrieval quality.
TODO:
Hey, thanks for the great project!
I noticed that you always use
_transform_ndarray
when encoding an image, while_transform_blob
seems to be more in line with the original code (e.g. here and here).Unfortunately they give quite different results as explained in the warning here:
_transform_blob
appliesResize
to a PIL Image, hence uses anti-aliasing, while_transform_ndarray
appliesResize
to an ndarray, and does not use anti-aliasing. If you plot the results, they look quite different. In terms of CLIP embeddings, in my example images I get cosine similarities of around0.94
(ViT-H14::laion2b-s32b-b79k), which is less than I would have expected.Am I doing something wrong? Are the models you provide with clip-as-a-service trained with a different preprocessing function than the ones I located in the original repos? Are the text embeddings now slightly misaligned to the image embeddings?