Spurned by the need to either document mask_ckpt.txt or obfuscate it from the user, this adds settings to DPP for controlling the generation and use of the training/testing/validation masks it stores.
A new flag (force_split_partition) and corresponding setter (force_split_shuffle) have been added to control whether or not to always make a new mask instead of loading a previous one; it defaults to the previous state of loading a previous split from an existing mask.
split_raw_data was changed to accommodate it. It gained a new parameter to take in the force_split_partition flag and uses it to determine whether to try and read a pre-existing mask. The code for mask reading and generation, meanwhile, was factored out into get_split_mask for the sake of having separate functions for separate tasks (making a mask vs using it to split data).
Alongside these changes, test were added not just for force_split_shuffle and get_split_mask, but also for the split-defining functions set_test_split and set_validation_split. Some documentation was also added to the leaf counting tutorial to briefly explain DPP's ways of storing and reusing dataset splits for repeatable training.
The test suite w/ additions passes and training for all of the current problem types functions with the changes to dataset splitting.
Spurned by the need to either document
mask_ckpt.txt
or obfuscate it from the user, this adds settings to DPP for controlling the generation and use of the training/testing/validation masks it stores.A new flag (
force_split_partition
) and corresponding setter (force_split_shuffle
) have been added to control whether or not to always make a new mask instead of loading a previous one; it defaults to the previous state of loading a previous split from an existing mask.split_raw_data
was changed to accommodate it. It gained a new parameter to take in theforce_split_partition
flag and uses it to determine whether to try and read a pre-existing mask. The code for mask reading and generation, meanwhile, was factored out intoget_split_mask
for the sake of having separate functions for separate tasks (making a mask vs using it to split data).Alongside these changes, test were added not just for
force_split_shuffle
andget_split_mask
, but also for the split-defining functionsset_test_split
andset_validation_split
. Some documentation was also added to the leaf counting tutorial to briefly explain DPP's ways of storing and reusing dataset splits for repeatable training.The test suite w/ additions passes and training for all of the current problem types functions with the changes to dataset splitting.