We should make it possible to specify different splits for the same dataset.
This avoids the need to re-"prep" a dataset every time; a dataset will just be a set of files in a folder--without sub-directories for "train"/"val"/"test"--and the splits will be in a separate file in that directory.
Dataset/datapipe classes should accept a splits_path argument, that will default to None.
If the splits_path argument is None, then the datapipe class looks in a default location for a single splits path (and raises a FileNotFoundError if it's not found).
The splits_path wil be distinct from what we now call dataset_csv_path. It will be a json file, basically metadata, that declares not only what we now call dataset_csv_path but also any other paths needed for a split. In the case of a frame classification dataset, this includes the vectors of sample IDs and indices within each sample.
Probably we should rename dataset_csv_path to something like inputs_targets_paths_csv for clarity.
So we'll need to:
[ ] add splits_path to dataset classes
[ ] modify how prep.frame_classification works to not make split sub-directories
Related to #748
We should make it possible to specify different splits for the same dataset. This avoids the need to re-"prep" a dataset every time; a dataset will just be a set of files in a folder--without sub-directories for "train"/"val"/"test"--and the splits will be in a separate file in that directory.
Dataset/datapipe classes should accept a
splits_path
argument, that will default to None. If thesplits_path
argument is None, then the datapipe class looks in a default location for a single splits path (and raises a FileNotFoundError if it's not found).The
splits_path
wil be distinct from what we now calldataset_csv_path
. It will be a json file, basically metadata, that declares not only what we now calldataset_csv_path
but also any other paths needed for a split. In the case of a frame classification dataset, this includes the vectors of sample IDs and indices within each sample.Probably we should rename
dataset_csv_path
to something likeinputs_targets_paths_csv
for clarity.So we'll need to:
splits_path
to dataset classesprep.frame_classification
works to not make split sub-directories