vanderschaarlab / autoprognosis

A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
https://www.autoprognosis.vanderschaar-lab.com/
Apache License 2.0
114 stars 26 forks source link

Option for stratified CV by id #33

Closed HLasse closed 1 year ago

HLasse commented 1 year ago

We'd love to use your excellent library for training and explaining prediction models on our EHR data.

However, we have multiple rows/prediction times for each patient and therefore need to stratify the CV by patient-id to avoid information leakage between CV folds.

In evalutate_estimator (autoprognosis/utils/tester.py) it seems that you are currently using StratifiedKFold to create folds when training models. sklearn.model_selection.StratifiedGroupKFold should do the trick but requires adding a group parameter to the user facing API.

Is this something you'd consider adding?

I'd be happy to contribute a PR if desired.

bcebere commented 1 year ago

Hello @HLasse ! Thank you for the feedback!

Yes, please raise a PR on the issue, if possible. If not, I will investigate the issue in the following days/week.

Thanks!