public datasets for evaluation

Hi there, I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:

Facial Emotion Recognition 2013
Dataset I found on Kaggle has train dataset 28,709, Val(public test) 3,589, (Train+Val 32,298 in total) and Test (private test) 3,589.
STL-10 Tensorflow stl10 has training dataset with 5,000 images and testing dataset with 8,000.
EuroSAT Tensorflow eurosat only has training dataset with 27,000 images.
RESISC45 The site Tensorflow refers to only have training dataset, which is 31,500 images.
GTSRB This archive I found has 2 training datasets (GTSRB_Final_Training_Images.zip and GTSRB-Training_fixed.zip), but both have size different from Table 9.

This is what Table 9 shows:	Dataset	Classes	Train size	Test size
Facial Emotion Recognition 2013	8	32,140	3,574	accuracy
STL-10	10	1000	8000	accuracy
EuroSAT	10	10,000	5,000	accuracy
RESISC45	45	3,150	25,200	accuracy
GTSRB	43	26,640	12,630	accuracy

It would be greatly appreciated if you could point me to the source of data split shown in Table 9.

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.

root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])

We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:

root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError

GTSRB: As @pj-ms found in #156, it turns out that we have used inconsistent train/test split.

Hi Jong, thanks so much for all the information! It is super helpful.

I have some questions about two more datasets and would really appreciate it if you could help. Thanks in advance.

Birdsnap: the official site only provides the image urls. When using the associated script from the official dataset to download the images, I ended up with ""“NEW_OK:40318, ALREADY_OK:0, DOWNLOAD_FAILED:5030, SAVE_FAILED:0, MD5_FAILED:4481, MYSTERY_FAILED:0.”. Have you folks experienced similar problems?

CLEVR(Counts): the CLIP paper says "2,500 random samples of the CLEVR dataset (Johnson et al., 2017)", while the official data site says "A training set of 70,000 images and 699,989 questions, A validation set of 15,000 images and 149,991 questions, A test set of 15,000 images and 14,988 questions". The original dataset seems to be a VQA dataset. According to the prompts and words in the paper ", counting objects in synthetic scenes (CLEVRCounts)". It seems that it is transformed into a counting classification dataset. Could you share a little bit information about how this was achieved and do you happen to still have the sampling script? Thanks

Thanks!

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.

STL-10: We reported the average of the 10 pre-defined folds as provided by the official source

EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])
We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError
GTSRB: As @pj-ms found in GTSRB dataset issue #156, it turns out that we have used inconsistent train/test split.

@jongwook Thank you so much for sharing these details. I have two more detailed questions: For datasets EuroSAT and RESISC45, in your experiments is seed=42 always used for a deterministic sampling? From Table 9, the train size of RESISC45 is 3,150. I got train set with 3,150 images and validation set with 3,150 images with your code. Are they supposed to be added together to form total train size as 6,300? We would like to set our benchmarking settings comparable to CLIP if possible. However, when I run sampled EuroSAT with a few models ( such as ResNet50, ResNet101, efficientnet_b0, and clip ViT-B/32), scores I got are all about 2.3~4.6% less than their equivalent in Table 10. And, RESISC45 scores with models mentioned previously are close to what in Table 10, but 5% higher on ViT_base_patch16_224.

openai / CLIP

public datasets for evaluation #45