tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.27k stars 1.53k forks source link

Preferred way of handling missing test data #235

Closed jackd closed 5 years ago

jackd commented 5 years ago

What is the preferred way of implementing DatasetBuilders for datasets which have missing fields in a held-out test set?

What I've tried so far

  1. A simple option is to save labels with some meaningless values. This takes up unecessary disk space (and I imagine reading time), though in most cases this is probably negligible.

  2. From what I can tell, different configs/splits on the same builder must share the same info which means the values cannot be left out based on the config/split. EDIT: apparently different configs can vary the info.

  3. Different builders could be implemented, but that is, in my opinion: (a) counter-intuitive from a users perspective. split is supposed to handle this, so this is where they will be looking; and (b) training/test data is often packaged together. Separate builders (as far as I'm aware) are unable to share the same downloaded files.

It would be nice if... There was a clear policy on how to handle such situations.

rsepassi commented 5 years ago

So far we've been trying to be explicit about missing values in a feature-specific way. For example, ClassLabel should have an additional class indicating "missing". For text, it can simply be the empty string "". etc. What's the case you're considering?

jackd commented 5 years ago

Human pose estimation: MPII Human Pose (PR) has multiple person annotations per image including center, scale, joint annotations... In the held-out test set joint annotations are not given (but center/scale are).

Currently putting in all zeros (maybe they should b -1s...).

rsepassi commented 5 years ago

I think -1 for missing positive numerical values seems sensible.