Interpretability module: sparse linear models via LASSO

luigibonati commented 7 months ago

Description

Add sparse linear models optimized via LASSO as tools for interpreting the CVs and/or the resulting states, as done here: https://pubs.acs.org/doi/abs/10.1021/acs.jctc.2c00393.

I started from the notebook that @pietronvll and I did. We implemented both the classifier case (as done in stateinterpreter) and also the regression one. A few changes:

I extended the functions to work also for the multi-class case
I changed the scoring function to use the balanced_accuracy_score instead of the standard one in case the datasets are imbalanced.

For both the regression and classification the signature is (almost) the same, with both returning the optimized estimator together with the list of non-zero features and their coefficients. I also did separate functions to plot the results (coefficient paths, score and number of features).

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] Function: lasso_classification (based on sckitlearn.LogisticRegressionCV)
[x] Function: lasso_regression (based on sckitlearn.LassoCV)
[x] Plotting functions
[x] Docstrings
[x] Regtests
[x] Raise error when importing module if scikit-learn is not installed
[x] Add documentation pages
[x] Add scikit-learn dependency to GA

Tutorials

Work in progress

[x] Tutorial: LASSO functions
[x] Tutorial: Stateinterpreter with DeepTICA CVs (porting https://github.com/luigibonati/md-stateinterpreter/blob/main/tutorials/2_hierarchical_classification.ipynb)

Questions

[x] This requires scikit-learn as an additional dependency, which I would keep optional
[x] As of now, I put these functions inside utils.lasso. However, since there is already also the sensitivity analysis contained in utils.explain we might move all these functions into a new module called explain?

Status

[x] Ready to go

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 91.90751% with 28 lines in your changes missing coverage. Please review.

Project coverage is 92.50%. Comparing base (3f9adeb) to head (71ad599).

Additional details and impacted files

luigibonati commented 2 months ago

I have put everything into a new explain submodule, containing sensitivity analysis and sparse models

will merge it soon

luigibonati / mlcolvar