We'd love to use your excellent library for training and explaining prediction models on our EHR data.
However, we have multiple rows/prediction times for each patient and therefore need to stratify the CV by patient-id to avoid information leakage between CV folds.
In evalutate_estimator (autoprognosis/utils/tester.py) it seems that you are currently using StratifiedKFold to create folds when training models. sklearn.model_selection.StratifiedGroupKFold should do the trick but requires adding a group parameter to the user facing API.
We'd love to use your excellent library for training and explaining prediction models on our EHR data.
However, we have multiple rows/prediction times for each patient and therefore need to stratify the CV by patient-id to avoid information leakage between CV folds.
In
evalutate_estimator
(autoprognosis/utils/tester.py
) it seems that you are currently usingStratifiedKFold
to create folds when training models.sklearn.model_selection.StratifiedGroupKFold
should do the trick but requires adding agroup
parameter to the user facing API.Is this something you'd consider adding?
I'd be happy to contribute a PR if desired.