patrickjohncyh / fashion-clip

FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.
MIT License
292 stars 34 forks source link

How train FashionCLIP model on Spanish Text with Images? #6

Closed karndeepsingh closed 1 year ago

karndeepsingh commented 1 year ago

Hi, I have been following your work since long time and I am amazed to see the latest development in CLIP domain. However, I am also working in similar project in retail “search” and would to use the work on my dataset which has mix of products like electronics, fashion etc.. also the description of products are in spanish.

  1. How I can change the text encoder for spanish langauga using HF CLIP API or FashionCLIP API for custom training of the model?
  2. Is advisable to train a single CLIP model for all the mix of products from different category?

Please help me with above questions.

thanks

vinid commented 1 year ago

Hello!

1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text

2 I don't see any problem with that!

karndeepsingh commented 1 year ago

Hello!

1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text

2 I don't see any problem with that!

Thanks @vinid for answering.

karndeepsingh commented 1 year ago

Hello!

1 Since we have an huggingface model you can probably use https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text

2 I don't see any problem with that!

Is it the same repo used for training FashionCLIP? Also, what could be the better way to prepare the dataset for training using repo or huggingface any reference to that would be helpful.

Thanks @vinid

vinid commented 1 year ago

No we used another script to fine tune FashionCLIP, but since you want to use HF you should probably refer to the link I shared with you

There's a readme in the link that should also go through data preparation!

karndeepsingh commented 1 year ago

No we used another script to fine tune FashionCLIP, but since you want to use HF you should probably refer to the link I shared with you

There's a readme in the link that should also go through data preparation!

Thanks @vinid for recommending. If you can share the other script that was used for training FashionCLIP that would also a great help. As I want to take a look into it for reference.

Thanks again!

vinid commented 1 year ago

It's a slightly edited version of this one https://github.com/openai/CLIP/issues/83

saskiabosma commented 1 year ago

@karndeepsingh , FYI I tried the model as-is with a few queries in Spanish and the results were actually ok ! Maybe some clothing vocabulary words are close enough in English and Spanish that the encoder manages to bring them to the correct embedding space zone.

karndeepsingh commented 1 year ago

It's a slightly edited version of this one openai/CLIP#83

@karndeepsingh , FYI I tried the model as-is with a few queries in Spanish and the results were actually ok ! Maybe some clothing vocabulary words are close enough in English and Spanish that the encoder manages to bring them to the correct embedding space zone.

That's true! Did you also tried to fine-tune the model on Spanish text?

karndeepsingh commented 1 year ago

It's a slightly edited version of this one openai/CLIP#83

Thanks @vinid. I just have one more question please help me to understand: I have Images of the product with textual information like "Category of Product","Product Title"," Attributes of Product" for example :

  1. Category of Product : "Dress"
  2. Product Title : "Women Classic White Self-Design Corset Dress"
  3. Attributes of Product : "Brand: Berrylush, Pattern: Corset, Color: White, style: self-design"

How should I use the above text information to make a proper meaningful caption for my <image, text> pair? Any suggestion would be great as it is the most important step in the process. Please help me to understand with above example.

Also, How Can we use this model for zero-shot? Did you use any predefined prompt at the backend in hugging face spaces demo for classifying the image from given set of labels?

Thanks

vinid commented 1 year ago

Hello!

It's hard to say given a single example. You probably need to combine the product title and the attributes in some way. The category might not be useful (it appears in the product title).

The model is the same as CLIP, there's an example in the colab notebook that also shows the prompts!

Sladerix commented 5 months ago

It's a slightly edited version of this one openai/CLIP#83

Can you post the actual training code to the fashion-clip repo?

vinid commented 5 months ago

Hi!

Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.

Is there something in particular you'd like to see?

Sladerix commented 5 months ago

Yes, we are trying to do some fine tuning starting from your model.

the problem is that we cannot apply the preprocess of CLIPProcessor … but actually we are not very sure about what we are doing, so if the training code is available would help so much

vinid commented 5 months ago

We don't have training code for the HF weights.

The model was trained with the OpenAI CLIP code you see above and then we exported the weights to the HF format.

You unfortunately cannot use the current weights with OpenAI's code. If you want to fine tune it, you need to use HF scripts (they have one for contrastive training)

Sladerix commented 5 months ago

Yes infact we were trying to load the checkpoint file like showed in that post, but on HF the weights are stored as “.bin” file and not as “.pt”

so where can we find that scripts? (We never used HF before)

vinid commented 5 months ago

This is a good starting point: https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text

abhishek0093 commented 4 months ago

Hi!

Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.

Is there something in particular you'd like to see?

Hi @vinid , Thanks for this amazing work! I'm trying to reproduce the Fashion-clip results using open_clip implementation. I have gathered good insights from your nature article. But there are following 2 things which aren't clear :

It would be very helpful if you can tell about your parameters for these.

vinid commented 4 months ago

Hi!

I think the last version (the one on HF) was trained with a 1024 barch size.

Unless you tweak stuff a bit I'm not sure you can change the image size. We used 224px x 224px. For our use cases it seemed fine!

On Sun, Feb 25, 2024, 22:06 Abhishek Mishra @.***> wrote:

Hi!

Not sure we still have it but I'll check. However, it's the same code you see in the link above, we didn't do major changes to that pipeline.

Is there something in particular you'd like to see?

Hi @vinid https://github.com/vinid , Thanks for this amazing work! I'm trying to reproduce the Fashion-clip results using open_clip implementation. I have gathered good insights from your nature article. But there are following 2 things which aren't clear :

  • What Batch size was used for finetuning. It has been suggested to keep batch-size large enough to make model better distinguish relevant text/image, but I'm not sure what would be optimal choice .
  • What was the image_resize value that was used. Is it 224 *224 and do you suggest increasing this value to capture some information loss.

It would be very helpful if you can tell about your parameters for these.

— Reply to this email directly, view it on GitHub https://github.com/patrickjohncyh/fashion-clip/issues/6#issuecomment-1963382915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARBSS6V4NN2RCMUWS5M4BDYVQQ7FAVCNFSM6AAAAAAWKCLTBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTGM4DEOJRGU . You are receiving this because you were mentioned.Message ID: @.***>

abhishek0093 commented 4 months ago

Okay. Thanks a lot !

abhishek0093 commented 4 months ago

Hi @vinid , In the nature article the cost has been calculated based on aws p3.2xlarge machine. However, when I'm trying to run 1024 batch size on the same machine I'm getting OOM error. For the 256 batch_size I guess it consumes ~12G memory, so I'm not sure if I can fit 1024 batch_size on single machine. Did you used multiple GPUs/bigger machine ?

vinid commented 4 months ago

Hi!

You are right! The article describes how we trained FashionCLIP 1.0, FashionCLIP 2.0 was trained on a larger machine.

Hope this helps!

abhishek0093 commented 4 months ago

Okay, I will try on a larger machine. Thanks @vinid for your prompt response.

vinid commented 4 months ago

@patrickjohncyh can give more details on how it was trained

patrickjohncyh commented 4 months ago

Hey @abhishek0093! @vinid is right --- we started to use a larger machine after the nature article to achieve a greater batch size. In order to do this, we had to use multiple GPUs, so you might want to try p3.8xlarge, for example. You can refer to this for multi gpu implementation details.

abhishek0093 commented 4 months ago

Thanks @vinid @patrickjohncyh ! Also I wanted to confirm that while moving from FashionClip1.0 to 2.0, we only changed the initial model and batch_size which resulted in better performance and other parameters remained the same. Specifically the training dataset size of 700k remained the same right or was it also increased in the 2nd version ?

vinid commented 4 months ago

Yea! Dataset was the very same dataset!

abhishek0093 commented 4 months ago

Okay. Thanks again for the help.

patrickjohncyh commented 4 months ago

@abhishek0093 -- dataset remained the same, but we fine-tuned off laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint.

miguelalba96 commented 3 months ago

@patrickjohncyh @vinid I have a few questions related to the optimization and your help would be amazing. I have around 10M image-text pairs from fashion data (from my workplace) and the objetive is fine tuning CLIP to use it internally to improve our search, I created captions randomizing templates and product information, so I have captions like:

['A photo depicting a gray jersey long sleeve shirt made for women with visual details including: long-sleeves, crew neck.',
 'A picture of a brown hoodie designed for ladies with long-sleeves as visual attributes.',
 'A photo depicting a brown jacket for women made with front zipper as visual highlights.',
 'Some pair of blue jeans',
 'A photo of a pair of blue jeans",
 'This image showcases a blue quilted jacket designed for ladies with the following visual highlights: hood, front zipper.']

my captions are bounded to colors, article type and fabrics/visual details (if available)

my images look like this: image

  1. Do you have an idea of the effect of the batch size during fine tuning compared to the large one used in the pre-training phase, how did you deal with class/captions collisions and data that is very similar to each other when training?, i.e. on a single training batch of >1024 you may have items with the same or similar caption. The model attempts to maximize cosine similarity on the diagonal (matching image-caption pairs) while simultaneously pushing away negative pairs. However, some negative pairs contain captions similar to the positive ones, leading to the model trying to achieve both maximization and minimization within the same batch (see the "blue jean" captions above).

did you have this problem at all (maybe the image-captions you used were all unique)?

do you think that maybe a smaller batch size can help mitigate my problem? (some intuition: if in batch size A I have a pair of jeans, it doesn't matter that it is compared against thousands of images, because in batch B I will probably have another pair of jeans) ❓

  1. I have brand too but I am not sure if including it can help to improve the captions because sometimes you cannot tell from an image easly which brand it has, so maybe it will make more difficult the optimization, imagine 2 white t-shirts from high end and low end brands, the may look exactly the same 😕

  2. I decided to use LoRA because fine-tuning the entire model laion/CLIP-ViT-B-32-laion2B-s34B-b79K is giving me very bad results (average cosine similarity between image and caption (the diagonal of the similarity matrix) doesn't go up from 0.3), do you maybe have tips to have a stable fine-tuning? what was the best setting for you in weight decay and learning rate?, maybe you can share the optimizer hyperparams too?

thank you for your time

patrickjohncyh commented 3 months ago

Hi @miguelalba96,

  1. Something unique to our dataset was relatively rich representation of the objects, meaning, we had fine grained attributes about the product. Thus, it was less likely we encounter scenarios such as "blue jeans", "a pair of blue jeans". In this context, it is actually desirable for close products to be in the same batch so that we can learn fine grained attributes. One approach I think you can consider is to have a custom sampler that avoids sampling products which have too close a similarity in textual description into the same batch. This might overcome some problems you are stating -- you would have to play around with the threshold. You can use the base CLIp text embeddings here. The problem with small batch size is that it makes the constrastive loss ineffective. In genearl larger batch size is better, but not that larger batch size will also lead to earlier overfitting.

  2. I think having brand might be useful. Certain brands have prominent logo/words which will enable the model to learn the association with the brand in the text. For cases where brands may not be prominent in the image, the model may learn to ignore the brand in the text,and so i don't think adding brand will be too big a problem.

  3. I suspect hyper params will differ between full model and lora. I can't comment much as I have not used lora too much.