Open weiaicunzai opened 1 week ago
Hi @weiaicunzai , thanks for your attention on Data-Juicer~
We use CLIP as the default model to calculate the embeddings of image-text pairs, which works fine on English corpus but not on Chinese texts (ref https://github.com/openai/CLIP/issues/7). For Chinese texts, models like Chinese-CLIP might perform better.
So there is a possible way to do so is to split the datasets into two subsets in English and Chinese with our dedicated dataset_split_by_language tool, and then deploy different models for the image_text_similarity_filter
OP to handle them respectively.
Thank you for your excellent work.
Regarding my dataset, which includes both English and Chinese samples, I am wondering how I can simultaneously calculate the similarity scores between image and text pairs for both languages.