opendatalab / CLIP-Parrot-Bias

ECCV2024_Parrot Captions Teach CLIP to Spot Text
https://linyq17.github.io/CLIP-Parrot-Bias/
Apache License 2.0
63 stars 2 forks source link

alt_text Parrot Captions Teach CLIP to Spot Text

[ Paper ] [ Website ] [ Dataset (OpenDataLab)] [ Dataset (Hugging face) ] [Demo]

overview

TL;DR

  1. Captions in LAION-2B have a significant bias towards describing visual text content embedded in the images.
  2. Released CLIP models have strong text spotting bias in almost every style of web image, resulting in the CLIP-filtering datasets being inherently biased towards visual text-dominant data.
  3. CLIP models easily learn text-spotting capacity from parrot captions while failing to connect the vision-language semantics, just like a text-spotting parrot.
  4. We provide an alternative solution by releasing a less biased filtered LAION-2B 100M subset and pre-trained CLIP models.

News and Updates

Kmeans Model from LAION-400M

We trained the Kmeans model from the LAION-400M dataset CLIP ViT-B-32 features using fassi. We first used PCA to reduce the feature dimension. The training and inference code in kmeans.py.

PCA weigths Kmeans centrios
Download Download

Generating Synthetic Images from N-gram Vocabulary

The generation pipeline of synthetic images (sys_benchmark.py and Arial.ttf) and the N-gram Vocabulary we built from the dataset.

LAION-2B Caption 1-gram LAION-2B Caption 2-gram LAION-2B Co-Emb Text 1-gram
Download Download Download

A Less Text-biased LAION-100M Subset and CLIP Model

Data Cruation Pipeline

Training Details

our training code is based on OpenCLIP

Note that the OCR model is not perfect. The images in our filtered subset still contain some text content. Therefore, we also benchmark our trained model on the synthetic images benchmark.

100M subset ViT-B Models
Download Download
1-gram Synthetic Benchmark Ours
(100M)
CLIP
(WIT-400M)
OpenCLIP
(LAION-2B)
DC medium
128M (DC)
DC large
1.28B (DC)
Sync. Score (mean) $\downarrow$ 0.163 0.317 0.368 0.268 0.338
Sync. Score (std) 0.0659 0.0305 0.0427 0.0247 0.0341
DataComp benchmark Ours
(100M)
CLIP
(WIT-400M)
OpenCLIP
(LAION-2B)
DC medium
128M (DC)
DC large
1.28B (DC)
ImageNet 0.526 0.633 0.666 0.176 0.459
ImageNet dist. shifts 0.404 0.485 0.522 0.152 0.378
VTAB 0.481 0.526 0.565 0.259 0.426
Retrieval 0.421 0.501 0.560 0.219 0.419
Average 0.443 0.525 0.565 0.258 0.437

Acknowledgement

Thanks for these good works:

Reference

@article{lin2023parrot,
    title={Parrot Captions Teach CLIP to Spot Text}, 
    author={Yiqi Lin and Conghui He and Alex Jinpeng Wang and Bin Wang and Weijia Li and Mike Zheng Shou},
    journal={arXiv preprint arXiv:2312.14232},
    year={2023}
}

@article{he2024opendatalab,
  title={Opendatalab: Empowering general artificial intelligence with open datasets},
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
  journal={arXiv preprint arXiv:2407.13773},
  year={2024}
}

License

Apache 2.0 License