We've been making good progress so far.
However, I find out manual validation of the dataset quality is tedious and hard to manage. This issue is to create a uniform validation for our datasets.
Criteria (More criterion is welcome!)
criterion 0: info.json exists and is valid (contains task, label, id_col, eval_metric)
criterion 1: no overlaps among train/dev/test
criterion 2: same feature space; size split 80%, 5%, 15%
criterion 3: train and test set contains all labels (i.e. no zero shot)
criterion 4: image path exists
criterion x: no data leakage (i.e. certain features can directly lead to 100% accuracy): this has not been implemented in validate_dataset script
We've been making good progress so far. However, I find out manual validation of the dataset quality is tedious and hard to manage. This issue is to create a uniform validation for our datasets.
Criteria (More criterion is welcome!)
info.json
exists and is valid (containstask
,label
,id_col
,eval_metric
)script for validation
dutils/validate_dataset.py
Usage