vocalpy / vak

A neural network framework for researchers studying acoustic communication
https://vak.readthedocs.io
BSD 3-Clause "New" or "Revised" License
77 stars 16 forks source link

ENH: Make it possible to specify different splits for datasets #749

Open NickleDave opened 5 months ago

NickleDave commented 5 months ago

Related to #748

We should make it possible to specify different splits for the same dataset. This avoids the need to re-"prep" a dataset every time; a dataset will just be a set of files in a folder--without sub-directories for "train"/"val"/"test"--and the splits will be in a separate file in that directory.

Dataset/datapipe classes should accept a splits_path argument, that will default to None. If the splits_path argument is None, then the datapipe class looks in a default location for a single splits path (and raises a FileNotFoundError if it's not found).

The splits_path wil be distinct from what we now call dataset_csv_path. It will be a json file, basically metadata, that declares not only what we now call dataset_csv_path but also any other paths needed for a split. In the case of a frame classification dataset, this includes the vectors of sample IDs and indices within each sample.

Probably we should rename dataset_csv_path to something like inputs_targets_paths_csv for clarity.

So we'll need to: