Dataset Validation - Githubissues

We've been making good progress so far. However, I find out manual validation of the dataset quality is tedious and hard to manage. This issue is to create a uniform validation for our datasets.

Criteria (More criterion is welcome!)

criterion 0: info.json exists and is valid (contains task, label, id_col, eval_metric)
criterion 1: no overlaps among train/dev/test
criterion 2: same feature space; size split 80%, 5%, 15%
criterion 3: train and test set contains all labels (i.e. no zero shot)
criterion 4: image path exists
criterion x: no data leakage (i.e. certain features can directly lead to 100% accuracy): this has not been implemented in validate_dataset script

script for validation

dutils/validate_dataset.py

Usage

python dutils/validate_dataset.py \
    --dataset_dir datasets/pokemon_0421 \
    --id_col name \
    --label_cols type_1 type_2

lujiaying / MUG-Bench

Dataset Validation #4

Criteria (More criterion is welcome!)

script for validation