openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
23.83k stars 3.14k forks source link

public datasets for evaluation #45

Open meigaoms opened 3 years ago

meigaoms commented 3 years ago

Hi there, I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:

  1. Facial Emotion Recognition 2013
    Dataset I found on Kaggle has train dataset 28,709, Val(public test) 3,589, (Train+Val 32,298 in total) and Test (private test) 3,589.
  2. STL-10 Tensorflow stl10 has training dataset with 5,000 images and testing dataset with 8,000.
  3. EuroSAT Tensorflow eurosat only has training dataset with 27,000 images.
  4. RESISC45 The site Tensorflow refers to only have training dataset, which is 31,500 images.
  5. GTSRB This archive I found has 2 training datasets (GTSRB_Final_Training_Images.zip and GTSRB-Training_fixed.zip), but both have size different from Table 9.
This is what Table 9 shows: Dataset Classes Train size Test size Evaluation metric
Facial Emotion Recognition 2013 8 32,140 3,574 accuracy
STL-10 10 1000 8000 accuracy
EuroSAT 10 10,000 5,000 accuracy
RESISC45 45 3,150 25,200 accuracy
GTSRB 43 26,640 12,630 accuracy

It would be greatly appreciated if you could point me to the source of data split shown in Table 9.

pj-ms commented 2 years ago

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

image

In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

jongwook commented 2 years ago

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

  1. Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
  2. STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
  3. EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])

We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

  1. RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError
  1. GTSRB: As @pj-ms found in #156, it turns out that we have used inconsistent train/test split.
pj-ms commented 2 years ago

Hi Jong, thanks so much for all the information! It is super helpful.

I have some questions about two more datasets and would really appreciate it if you could help. Thanks in advance.

Birdsnap: the official site only provides the image urls. When using the associated script from the official dataset to download the images, I ended up with ""“NEW_OK:40318, ALREADY_OK:0, DOWNLOAD_FAILED:5030, SAVE_FAILED:0, MD5_FAILED:4481, MYSTERY_FAILED:0.”. Have you folks experienced similar problems?

CLEVR(Counts): the CLIP paper says "2,500 random samples of the CLEVR dataset (Johnson et al., 2017)", while the official data site says "A training set of 70,000 images and 699,989 questions, A validation set of 15,000 images and 149,991 questions, A test set of 15,000 images and 14,988 questions". The original dataset seems to be a VQA dataset. According to the prompts and words in the paper ", counting objects in synthetic scenes (CLEVRCounts)". It seems that it is transformed into a counting classification dataset. Could you share a little bit information about how this was achieved and do you happen to still have the sampling script? Thanks

image

Thanks!

meigaoms commented 2 years ago

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

image

In the paper cited by CLIP paper about FER-2013, it says "The resulting dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

  1. Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
  2. STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
  3. EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])

We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

  1. RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError
  1. GTSRB: As @pj-ms found in GTSRB dataset issue #156, it turns out that we have used inconsistent train/test split.

@jongwook Thank you so much for sharing these details. I have two more detailed questions: For datasets EuroSAT and RESISC45, in your experiments is seed=42 always used for a deterministic sampling? From Table 9, the train size of RESISC45 is 3,150. I got train set with 3,150 images and validation set with 3,150 images with your code. Are they supposed to be added together to form total train size as 6,300? We would like to set our benchmarking settings comparable to CLIP if possible. However, when I run sampled EuroSAT with a few models ( such as ResNet50, ResNet101, efficientnet_b0, and clip ViT-B/32), scores I got are all about 2.3~4.6% less than their equivalent in Table 10. And, RESISC45 scores with models mentioned previously are close to what in Table 10, but 5% higher on ViT_base_patch16_224.