patrickjohncyh / fashion-clip

FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.
MIT License
293 stars 34 forks source link

Fashion-Clip ONNX #27

Closed yaman closed 5 months ago

yaman commented 7 months ago

hi @vinid, again thanks for the amazing work. Currently I am working on onnx exported fashion-clip with hf-optimum-cli. I have four questions for my use case: in the context of creating independent embeddings for text and image with fashion-clip, and store it to qdrant(vectordb) for a later on similarity search;

1- To create image embeddings, I need to implement preprocessing of the image in Rust since there is no implementation of ClipImageProcessor in Rust. Checking the fashion-clip codebase, I traced back to transformers and found ClipImageProcessor but couldn't figure out the exact process. Hence, I wanted to ask for your guidance on the steps of preprocessing the image for the fashion-clip model, to be used in combination with exported onnx model.

Currently I implemented a preprocessor based on the openai clip model. But the embedding created by my preprocessor+fashion-clip onnx model differs heavily from the original fashion-clip model.

2- And also there is a slight difference (~atol-3) in embeddings generated between; a) using fashion_clip.encode_image vs. b) using preprocessed inputs by providing text and image to model(**inputs) in which I am a little bit confused(I would expect atol-5 or less difference. Or maybe this difference is negligible?

This is where I got confused: which embedding result should I take into account when testing the ONNX accuracy performance(atol-5 or atol-6)?

3- For the purpose of simplicity, I would like to split text and image embedding functionalities to two different onnx models; i.e. text_model.onnx and image_model.onnx. I found out a splitted openai clip model into text and image, but couldn't figure out the internals of how to do it. It is much appreciated if I can pick your brain.

4- Lastly, probably this would be the silliest of all questions asked;

I have a requirement to handle different languages for text embedding generation(text2text and text2image multilingual - search).That's why for languages other than English, I want to use clip-vit-b-32-multilingual-v1 for text embedding generation. And use fashion-clip for image embeddings. What would happen if I run a similarity between text embeddings generated by clip-vit-multi and image embeddings generated by fashion-clip(text2image similarity)? Do I have to use only fashion-clip generated text embeddings coupled with fashion-clip image embeddings?

Thanks in advance! And if I can make the exported onnx model work similar to the original model, I would like to contribute to the fashion-clip hf model page if you are ok with it.

yaman commented 7 months ago

Note: I will follow the order of the operations in https://github.com/huggingface/transformers/blob/e5c12c03b711aa2a31b562de3bce92431b4bf662/src/transformers/models/clip/image_processing_clip.py. Will provide the results as soon as I get one.

yaman commented 7 months ago

Result is same, I followed the exact order of operations in preprocess from image_processing_clip.py