Automate splitting datasets into training, test and validation sets for ML in build

quiltdata / quilt

Quilt is a data mesh for connecting people with actionable data

https://quiltdata.com

Apache License 2.0

1.32k stars 91 forks source link

Automate splitting datasets into training, test and validation sets for ML in build #467

Closed kevinemoore closed 4 years ago

kevinemoore commented 6 years ago

For data packages used in machine learning, it would be useful for Quilt build to support splitting inputs into fixed sets for model training, and validation. For structured data, the various sets (training, test and validation) could be children of a common parent so that the entire dataset is available (by calling _data on the parent). Thanks to @rhiever for the suggestion.

rhiever commented 6 years ago

Great! It could also be useful to specify k-fold cross-validation splits.

This diagram shows a fairly standard training-validation-test split:

and this diagram shows k-fold cross-validation splits:

If k-fold cross-validation is used, typically the "training" and "validation" splits from the first diagram are combined into just the "training" split (leaving only the "training" and "test" splits), and the "training" split is then divided into the k-fold cross-validation splits.

akarve commented 4 years ago

Quilt 3, combined with PyTorch or TF APIs, now allows direct interaction with file system and provides primitives for arbitrary file organization.