modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.94k stars 176 forks source link

How to calculate the image_text_similarity scores for both Chinese and English? #473

Open weiaicunzai opened 1 week ago

weiaicunzai commented 1 week ago

Thank you for your excellent work.

Regarding my dataset, which includes both English and Chinese samples, I am wondering how I can simultaneously calculate the similarity scores between image and text pairs for both languages.

HYLcool commented 2 days ago

Hi @weiaicunzai , thanks for your attention on Data-Juicer~

We use CLIP as the default model to calculate the embeddings of image-text pairs, which works fine on English corpus but not on Chinese texts (ref https://github.com/openai/CLIP/issues/7). For Chinese texts, models like Chinese-CLIP might perform better.

So there is a possible way to do so is to split the datasets into two subsets in English and Chinese with our dedicated dataset_split_by_language tool, and then deploy different models for the image_text_similarity_filter OP to handle them respectively.