Enhance cross validation

manuelgitgomes commented 3 months ago

Enhance cross validation by using scikit-learn functions.

manuelgitgomes commented 2 months ago

Hello @miguelriemoliveira and @Kazadhum.

I have now added cross validation support in batch execution on the branch dev/cross-validation.

The user now defines type of cross validation and its parameters in data.yaml: https://github.com/lardemua/atom/blob/b61df4d9019f76c921ac605441ca162c20c77e50/atom_batch_execution/experiments/rrbot_example/data.yml#L23-L26

Right now, fold creation is supported with StratifiedKFold, KFold, LeaveOneOut, and StratifiedShuffleSplit from scikit-learn. The classes used for stratification are the combination of sensors and patterns detected, as written in: https://github.com/lardemua/atom/blob/b61df4d9019f76c921ac605441ca162c20c77e50/atom_batch_execution/scripts/batch_execution#L30-L54

This then creates a new auto_rendered.yaml, with division of each run in folds, using -csf to define the collections used by the fold:

https://github.com/lardemua/atom/blob/b61df4d9019f76c921ac605441ca162c20c77e50/atom_batch_execution/experiments/rrbot_example/auto_rendered.yaml#L27-L44

process_results is also adapted to run with these folds!

I have done a test with rrbot and everything seemed nice, can you test on your machines?

miguelriemoliveira commented 2 months ago

Hi @manuelgitgomes ,

looks great. Thanks.

@Kazadhum I do not have a lot of time right now. Can you test it please?

One question: if we want the run the old way, is it possible or not?

Kazadhum commented 2 months ago

Hi@manuelgitgomes and @miguelriemoliveira! I'll test it as soon as I can. I'll try to tell you something today.

Kazadhum commented 2 months ago

Hi @manuelgitgomes and @miguelriemoliveira ! I just tested with stratified k-folding and it seems to be working correctly at first glance, as well as process_results.

BTW, is it still the case that it doesn't make sense to have the collection column in the processed results, since it also averages the collection number?

I can run more thorough tests next week, if necessary

manuelgitgomes commented 2 months ago

One question: if we want the run the old way, is it possible or not?

Right now, you can't, but I can change that easily enough (I think). On it.

BTW, is it still the case that it doesn't make sense to have the collection column in the processed results, since it also averages the collection number?

It doesn't, I believe. I can delete it.

Kazadhum commented 2 months ago

BTW, is it still the case that it doesn't make sense to have the collection column in the processed results, since it also averages the collection number?

It doesn't, I believe. I can delete it.

Now that I mention it, I think it doesn't make much sense to have any lines at all besides the "Average" line, since all other values are "meaningless" (as they belong to different collections). Do you agree?

manuelgitgomes commented 2 months ago

Now that I mention it, I think it doesn't make much sense to have any lines at all besides the "Average" line, since all other values are "meaningless" (as they belong to different collections). Do you agree?

In the processed results? Sure, makes sense. Can also be removed, but I don't see any wrong in leaving it there, right?

manuelgitgomes commented 2 months ago

Changed batch execution and process results to allow for an empty cross validation definition in data.yml. This guarantees backwards compatibility, which was tested with old_template.yml.j2 Also removed the "Collection #" column from the processed results.

miguelriemoliveira commented 2 months ago

Looks great!

manuelgitgomes commented 2 months ago

This has been merged to main, closing.

lardemua / atom

Enhance cross validation #899