Open meigaoms opened 3 years ago
Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.
For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper
In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.
Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
keep_paths = [os.path.relpath(path, root) for path in keep_paths]
train_paths.extend(keep_paths[:500])
valid_paths.extend(keep_paths[500:1000])
test_paths.extend(keep_paths[1000:])
We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
paths = paths[:len(paths) // 10]
elif split == 'valid':
paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
paths = paths[(len(paths) // 10) * 2:]
else:
raise NotImplementedError
Hi Jong, thanks so much for all the information! It is super helpful.
I have some questions about two more datasets and would really appreciate it if you could help. Thanks in advance.
Birdsnap: the official site only provides the image urls. When using the associated script from the official dataset to download the images, I ended up with ""“NEW_OK:40318, ALREADY_OK:0, DOWNLOAD_FAILED:5030, SAVE_FAILED:0, MD5_FAILED:4481, MYSTERY_FAILED:0.”. Have you folks experienced similar problems?
CLEVR(Counts): the CLIP paper says "2,500 random samples of the CLEVR dataset (Johnson et al., 2017)", while the official data site says "A training set of 70,000 images and 699,989 questions, A validation set of 15,000 images and 149,991 questions, A test set of 15,000 images and 14,988 questions". The original dataset seems to be a VQA dataset. According to the prompts and words in the paper ", counting objects in synthetic scenes (CLEVRCounts)". It seems that it is transformed into a counting classification dataset. Could you share a little bit information about how this was achieved and do you happen to still have the sampling script? Thanks
Thanks!
Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.
For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper
In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.
Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:
- Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
- STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
- EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750" seed = 42 random.seed(seed) train_paths, valid_paths, test_paths = [], [], [] for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]: keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500) keep_paths = [os.path.relpath(path, root) for path in keep_paths] train_paths.extend(keep_paths[:500]) valid_paths.extend(keep_paths[500:1000]) test_paths.extend(keep_paths[1000:])
We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.
- RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45" seed = 42 paths = sorted(glob.glob(os.path.join(root, "*.jpg"))) random.seed(seed) random.shuffle(paths) if split == 'train': paths = paths[:len(paths) // 10] elif split == 'valid': paths = paths[len(paths) // 10:(len(paths) // 10) * 2] elif split == 'test': paths = paths[(len(paths) // 10) * 2:] else: raise NotImplementedError
- GTSRB: As @pj-ms found in GTSRB dataset issue #156, it turns out that we have used inconsistent train/test split.
@jongwook Thank you so much for sharing these details. I have two more detailed questions:
For datasets EuroSAT and RESISC45, in your experiments is seed=42
always used for a deterministic sampling?
From Table 9, the train size of RESISC45 is 3,150. I got train set with 3,150 images and validation set with 3,150 images with your code. Are they supposed to be added together to form total train size as 6,300?
We would like to set our benchmarking settings comparable to CLIP if possible. However, when I run sampled EuroSAT with a few models ( such as ResNet50, ResNet101, efficientnet_b0, and clip ViT-B/32), scores I got are all about 2.3~4.6% less than their equivalent in Table 10. And, RESISC45 scores with models mentioned previously are close to what in Table 10, but 5% higher on ViT_base_patch16_224.
Hi there, I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:
Dataset I found on Kaggle has train dataset 28,709, Val(public test) 3,589, (Train+Val 32,298 in total) and Test (private test) 3,589.
It would be greatly appreciated if you could point me to the source of data split shown in Table 9.