iterative / ldb-resources

Apache License 2.0
28 stars 7 forks source link

`Workflow`: Missing guidance for splitting datasets #14

Open daavoo opened 2 years ago

daavoo commented 2 years ago

In many practical machine learning workflows, splitting a dataset into subsets is a common operation.

For example, in the data-centric ai competition 2 different splits (train, validation) are expected to be submitted. Different strategies for generating those splits might be tried and I would expect LDB to support these iterations.

I didn't find any guidance on how to perform these splitting iterations as part of the LDB workflow. Does the recommended workflow depend on https://github.com/iterative/ldb/issues/88 ?

daavoo commented 2 years ago

The way I currently implemented is by using --pipe and a python "constant" as source of truth.

Edited I ended up handling this via annotation field named split.

volkfox commented 2 years ago

The current split workflow involves the use of queries (to separate by JSON field, --limit (by count), or --tag (by tags).

We might want to give explicit examples. Also one idea is to address multi-stage splits by recipes.