How to calculate the image_text_similarity scores for both Chinese and English?

modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！

Apache License 2.0

2.94k stars 176 forks source link

Hi @weiaicunzai , thanks for your attention on Data-Juicer~

We use CLIP as the default model to calculate the embeddings of image-text pairs, which works fine on English corpus but not on Chinese texts (ref https://github.com/openai/CLIP/issues/7). For Chinese texts, models like Chinese-CLIP might perform better.

So there is a possible way to do so is to split the datasets into two subsets in English and Chinese with our dedicated dataset_split_by_language tool, and then deploy different models for the image_text_similarity_filter OP to handle them respectively.

modelscope / data-juicer

How to calculate the image_text_similarity scores for both Chinese and English? #473