Open YifanXu74 opened 8 months ago
Thanks for your interest in our work. Actually, our datasets are organized by K-means clustering labels. Therefore, the samples shown in HuggingFace are because the cluster is probably about the human concept. The dataset has also been balanced based on the Kmeans labels during sampling.
Got it. So does this mean that the parquet files of the released dataset are organized with K-means clustering labels without any shuffling? If so, this may cause some biased training problems with WebDataset, since WebDataset does not fully shuffle the data.
Yes, the released dataset is ordered by the clustering labels. If WebDataset is used for training, it needs to be shuffled before packing. Our released model is trained on the shuffled dataset. Thanks for your reminder. We will add more tips to README.
That's great. Currently when downloading the parquet dataset with img2dataset, the images in one tar file are probably under similar labels . It would be very useful if there could be a script provided to shuffle the downloaded data.
Sure, it can be shuffled using pandas. Here is a code snippet example for shuffling.
import os
import concurrent
import pandas as pd
import pyarrow.parquet as pq
def read_shuffle_multiple_parquet_files(parquet_dir):
files = [os.path.join(parquet_dir, f)
for f in os.listdir(parquet_dir)
if f.endswith('.parquet.snappy')]
with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
dfs = list(executor.map(pd.read_parquet, files))
df = pd.concat(dfs, ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df.to_parquet('out.parquet.snappy', compression='snappy')
Hi, nice work!
I noticed that the example samples shown on HuggingFace mostly consist of human faces. Does the actual distribution of the dataset like this? If so, does this result in significant bias?