opendatalab / CLIP-Parrot-Bias

ECCV2024_Parrot Captions Teach CLIP to Spot Text
https://linyq17.github.io/CLIP-Parrot-Bias/
Apache License 2.0
58 stars 2 forks source link

Why so many faces? #2

Open YifanXu74 opened 8 months ago

YifanXu74 commented 8 months ago

Hi, nice work!

I noticed that the example samples shown on HuggingFace mostly consist of human faces. Does the actual distribution of the dataset like this? If so, does this result in significant bias?

linyq17 commented 8 months ago

Thanks for your interest in our work. Actually, our datasets are organized by K-means clustering labels. Therefore, the samples shown in HuggingFace are because the cluster is probably about the human concept. The dataset has also been balanced based on the Kmeans labels during sampling.

YifanXu74 commented 8 months ago

Got it. So does this mean that the parquet files of the released dataset are organized with K-means clustering labels without any shuffling? If so, this may cause some biased training problems with WebDataset, since WebDataset does not fully shuffle the data.

linyq17 commented 8 months ago

Yes, the released dataset is ordered by the clustering labels. If WebDataset is used for training, it needs to be shuffled before packing. Our released model is trained on the shuffled dataset. Thanks for your reminder. We will add more tips to README.

YifanXu74 commented 8 months ago

That's great. Currently when downloading the parquet dataset with img2dataset, the images in one tar file are probably under similar labels . It would be very useful if there could be a script provided to shuffle the downloaded data.

linyq17 commented 8 months ago

Sure, it can be shuffled using pandas. Here is a code snippet example for shuffling.

import os
import concurrent
import pandas as pd
import pyarrow.parquet as pq
def read_shuffle_multiple_parquet_files(parquet_dir):
    files = [os.path.join(parquet_dir, f) 
                for f in os.listdir(parquet_dir) 
                    if f.endswith('.parquet.snappy')]
    with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
        dfs = list(executor.map(pd.read_parquet, files))
    df = pd.concat(dfs, ignore_index=True)
    df = df.sample(frac=1).reset_index(drop=True)
    df.to_parquet('out.parquet.snappy', compression='snappy')