paris-saclay-cds / ramp-workflow

Toolkit for building predictive workflows on top of pydata (pandas, scikit-learn, pytorch, keras, etc.).
https://paris-saclay-cds.github.io/ramp-docs/
BSD 3-Clause "New" or "Revised" License
68 stars 43 forks source link

Adding --data-label for easy testing on multiple data sets #244

Closed kegl closed 3 years ago

kegl commented 4 years ago

It happens often during testing, especially when using RAMP to manage experiments for a paper, that we want to execute the exact same submissions on several data sets. This is possible now using the --ramp-data-dir switch, but it is awkward that the data sets need to be in data/<data_label>/data (instead of data/<data_label>) and, more importantly, if we also want to save all the training outputs (--save-output), it is not possible now: ramp-test will overwrite the submissions/<submission>/training_ouput folder each time it is executed, whatever --ramp-data-dir is.

To solve this issue, we should introduce an optional --data-label switch so that ramp-test can create subfolders submissions/<submission>/training_ouput/<data_label>. This data label will also be passed to problem.get_train_data and problem.get_test_data so the user can flexibly change which data to load depending on the label.

To keep RAMP backward compatible, we should not change the behavior when --data-label is not specified. This means that in the testing script we should handle the two different signatures for problem.get_train_data and problem.get_test_data depending on whether --data-label is specified or not.