modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

Add image_pair_similarity_filter #393

Closed Qirui-jiao closed 4 weeks ago

Qirui-jiao commented 1 month ago

Calculate the cosine similarity of CLIP image features for paired images, and filter the samples based on this value. Hyperparameters: