Open justinxzhao opened 2 years ago
We do actually write this information when skip_saved_processed_inputs=False
here. Note that this only applies when we are using a dataset from a file, as opposed to a dataframe. So perhaps it could be extended to support the latter.
@tgaddair Ah, thanks for the catch! We should be sure to include this in our documentation.
At the moment, we don’t write the raw data splits to a separate file, i.e. (row #, split #).
This can be useful for when the preprocessed data is too large to write to disk, yet a user may still might want to inspect offline which rows of their dataset were used in which data subsets of their modeling run.
One potential location for such metadata would be in the existing
training_set_metadata.json
file, or perhaps a separatesplits.csv
file.