Closed karndeepsingh closed 1 year ago
Hello!
1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text
2 I don't see any problem with that!
Hello!
1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text
2 I don't see any problem with that!
Thanks @vinid for answering.
Hello!
1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text
2 I don't see any problem with that!
Is it the same repo used for training FashionCLIP? Also, what could be the better way to prepare the dataset for training using repo or huggingface any reference to that would be helpful.
Thanks @vinid
No we used another script to fine tune FashionCLIP, but since you want to use HF you should probably refer to the link I shared with you
There's a readme in the link that should also go through data preparation!
No we used another script to fine tune FashionCLIP, but since you want to use HF you should probably refer to the link I shared with you
There's a readme in the link that should also go through data preparation!
Thanks @vinid for recommending. If you can share the other script that was used for training FashionCLIP that would also a great help. As I want to take a look into it for reference.
Thanks again!
It's a slightly edited version of this one https://github.com/openai/CLIP/issues/83
@karndeepsingh , FYI I tried the model as-is with a few queries in Spanish and the results were actually ok ! Maybe some clothing vocabulary words are close enough in English and Spanish that the encoder manages to bring them to the correct embedding space zone.
It's a slightly edited version of this one openai/CLIP#83
@karndeepsingh , FYI I tried the model as-is with a few queries in Spanish and the results were actually ok ! Maybe some clothing vocabulary words are close enough in English and Spanish that the encoder manages to bring them to the correct embedding space zone.
That's true! Did you also tried to fine-tune the model on Spanish text?
It's a slightly edited version of this one openai/CLIP#83
Thanks @vinid. I just have one more question please help me to understand: I have Images of the product with textual information like "Category of Product","Product Title"," Attributes of Product" for example :
How should I use the above text information to make a proper meaningful caption for my <image, text> pair? Any suggestion would be great as it is the most important step in the process. Please help me to understand with above example.
Also, How Can we use this model for zero-shot? Did you use any predefined prompt at the backend in hugging face spaces demo for classifying the image from given set of labels?
Thanks
Hello!
It's hard to say given a single example. You probably need to combine the product title and the attributes in some way. The category might not be useful (it appears in the product title).
The model is the same as CLIP, there's an example in the colab notebook that also shows the prompts!
It's a slightly edited version of this one openai/CLIP#83
Can you post the actual training code to the fashion-clip repo?
Hi!
Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.
Is there something in particular you'd like to see?
Yes, we are trying to do some fine tuning starting from your model.
the problem is that we cannot apply the preprocess of CLIPProcessor … but actually we are not very sure about what we are doing, so if the training code is available would help so much
We don't have training code for the HF weights.
The model was trained with the OpenAI CLIP code you see above and then we exported the weights to the HF format.
You unfortunately cannot use the current weights with OpenAI's code. If you want to fine tune it, you need to use HF scripts (they have one for contrastive training)
Yes infact we were trying to load the checkpoint file like showed in that post, but on HF the weights are stored as “.bin” file and not as “.pt”
so where can we find that scripts? (We never used HF before)
This is a good starting point: https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text
Hi!
Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.
Is there something in particular you'd like to see?
Hi @vinid , Thanks for this amazing work! I'm trying to reproduce the Fashion-clip results using open_clip implementation. I have gathered good insights from your nature article. But there are following 2 things which aren't clear :
It would be very helpful if you can tell about your parameters for these.
Hi!
I think the last version (the one on HF) was trained with a 1024 barch size.
Unless you tweak stuff a bit I'm not sure you can change the image size. We used 224px x 224px. For our use cases it seemed fine!
On Sun, Feb 25, 2024, 22:06 Abhishek Mishra @.***> wrote:
Hi!
Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.
Is there something in particular you'd like to see?
Hi @vinid https://github.com/vinid , Thanks for this amazing work! I'm trying to reproduce the Fashion-clip results using open_clip implementation. I have gathered good insights from your nature article. But there are following 2 things which aren't clear :
- What Batch size was used for finetuning. It has been suggested to keep batch-size large enough to make model better distinguish relevant text/image, but I'm not sure what would be optimal choice .
- What was the image_resize value that was used. Is it 224 *224 and do you suggest increasing this value to capture some information loss.
It would be very helpful if you can tell about your parameters for these.
— Reply to this email directly, view it on GitHub https://github.com/patrickjohncyh/fashion-clip/issues/6#issuecomment-1963382915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARBSS6V4NN2RCMUWS5M4BDYVQQ7FAVCNFSM6AAAAAAWKCLTBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTGM4DEOJRGU . You are receiving this because you were mentioned.Message ID: @.***>
Okay. Thanks a lot !
Hi @vinid , In the nature article the cost has been calculated based on aws p3.2xlarge machine. However, when I'm trying to run 1024 batch size on the same machine I'm getting OOM error. For the 256 batch_size I guess it consumes ~12G memory, so I'm not sure if I can fit 1024 batch_size on single machine. Did you used multiple GPUs/bigger machine ?
Hi!
You are right! The article describes how we trained FashionCLIP 1.0, FashionCLIP 2.0 was trained on a larger machine.
Hope this helps!
Okay, I will try on a larger machine. Thanks @vinid for your prompt response.
@patrickjohncyh can give more details on how it was trained
Hey @abhishek0093! @vinid is right --- we started to use a larger machine after the nature article to achieve a greater batch size. In order to do this, we had to use multiple GPUs, so you might want to try p3.8xlarge, for example. You can refer to this for multi gpu implementation details.
Thanks @vinid @patrickjohncyh ! Also I wanted to confirm that while moving from FashionClip1.0 to 2.0, we only changed the initial model and batch_size which resulted in better performance and other parameters remained the same. Specifically the training dataset size of 700k remained the same right or was it also increased in the 2nd version ?
Yea! Dataset was the very same dataset!
Okay. Thanks again for the help.
@abhishek0093 -- dataset remained the same, but we fine-tuned off laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint.
@patrickjohncyh @vinid I have a few questions related to the optimization and your help would be amazing. I have around 10M image-text pairs from fashion data (from my workplace) and the objetive is fine tuning CLIP to use it internally to improve our search, I created captions randomizing templates and product information, so I have captions like:
['A photo depicting a gray jersey long sleeve shirt made for women with visual details including: long-sleeves, crew neck.',
'A picture of a brown hoodie designed for ladies with long-sleeves as visual attributes.',
'A photo depicting a brown jacket for women made with front zipper as visual highlights.',
'Some pair of blue jeans',
'A photo of a pair of blue jeans",
'This image showcases a blue quilted jacket designed for ladies with the following visual highlights: hood, front zipper.']
my captions are bounded to colors, article type and fabrics/visual details (if available)
my images look like this:
did you have this problem at all (maybe the image-captions you used were all unique)?
do you think that maybe a smaller batch size can help mitigate my problem? (some intuition: if in batch size A I have a pair of jeans, it doesn't matter that it is compared against thousands of images, because in batch B I will probably have another pair of jeans) ❓
I have brand too but I am not sure if including it can help to improve the captions because sometimes you cannot tell from an image easly which brand it has, so maybe it will make more difficult the optimization, imagine 2 white t-shirts from high end and low end brands, the may look exactly the same 😕
I decided to use LoRA because fine-tuning the entire model laion/CLIP-ViT-B-32-laion2B-s34B-b79K is giving me very bad results (average cosine similarity between image and caption (the diagonal of the similarity matrix) doesn't go up from 0.3), do you maybe have tips to have a stable fine-tuning? what was the best setting for you in weight decay and learning rate?, maybe you can share the optimizer hyperparams too?
thank you for your time
Hi @miguelalba96,
Something unique to our dataset was relatively rich representation of the objects, meaning, we had fine grained attributes about the product. Thus, it was less likely we encounter scenarios such as "blue jeans", "a pair of blue jeans". In this context, it is actually desirable for close products to be in the same batch so that we can learn fine grained attributes. One approach I think you can consider is to have a custom sampler that avoids sampling products which have too close a similarity in textual description into the same batch. This might overcome some problems you are stating -- you would have to play around with the threshold. You can use the base CLIp text embeddings here. The problem with small batch size is that it makes the constrastive loss ineffective. In genearl larger batch size is better, but not that larger batch size will also lead to earlier overfitting.
I think having brand might be useful. Certain brands have prominent logo/words which will enable the model to learn the association with the brand in the text. For cases where brands may not be prominent in the image, the model may learn to ignore the brand in the text,and so i don't think adding brand will be too big a problem.
I suspect hyper params will differ between full model and lora. I can't comment much as I have not used lora too much.
Hi, I have been following your work since long time and I am amazed to see the latest development in CLIP domain. However, I am also working in similar project in retail “search” and would to use the work on my dataset which has mix of products like electronics, fashion etc.. also the description of products are in spanish.
Please help me with above questions.
thanks