ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.13k stars 1.19k forks source link

Write out data split information as a separate file, i.e. splits.csv, separate from preprocessed data. #2375

Open justinxzhao opened 2 years ago

justinxzhao commented 2 years ago

At the moment, we don’t write the raw data splits to a separate file, i.e. (row #, split #).

This can be useful for when the preprocessed data is too large to write to disk, yet a user may still might want to inspect offline which rows of their dataset were used in which data subsets of their modeling run.

One potential location for such metadata would be in the existing training_set_metadata.json file, or perhaps a separate splits.csv file.

tgaddair commented 2 years ago

We do actually write this information when skip_saved_processed_inputs=False here. Note that this only applies when we are using a dataset from a file, as opposed to a dataframe. So perhaps it could be extended to support the latter.

justinxzhao commented 2 years ago

@tgaddair Ah, thanks for the catch! We should be sure to include this in our documentation.