Refactor CPI - Githubissues

mind-inria / hidimstat

HiDimStat: High-dimensional statistical inference tool for Python

https://mind-inria.github.io/hidimstat

BSD 2-Clause "Simplified" License

4 stars 4 forks source link

Refactor CPI #14

Closed jpaillard closed 1 month ago

jpaillard commented 2 months ago

I suggest a refactoring of the CPI functionality. It is inspired by the current implementation of permutation importance in scikit-learn. It aims at:

Sperate the model fit / predict / selection from the variable importance part (inspection)
Allow to use multiple variable importance method for a same fitted model (facilitates benchmarking)
Expose the fitting of covariate estimation. This should make clearer which features are used for the measure of importance and which data split are used to fit / predict the covariate estimator.
Leave the parallelization out of the VIM method (I think it should be done at the cross-fitting level)

I also updated the example using the diabetes dataset.

Linked to issues #12 and #13

codecov[bot] commented 2 months ago

Files with missing lines	Patch %	Lines
hidimstat/cpi.py	97.50%	2 Missing :warning:
hidimstat/loco.py	98.36%	1 Missing :warning:

```diff @@ Coverage Diff @@ ## main #14 +/- ## =========================================== - Coverage 91.79% 76.35% -15.45% =========================================== Files 44 46 +2 Lines 2926 2398 -528 =========================================== - Hits 2686 1831 -855 - Misses 240 567 +327 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jpaillard commented 2 months ago

I tried to address the different comments:

Improve the docstrings
Add tests
Rename and improve the example (adding a consistent implementation of LOCO and Permutation Importance to also evidence how the proposed implementation saves time by allowing to reuse the main predictor for various importance methods)

jpaillard commented 1 month ago

Also linked to #17 After discussing with @AngelReyero we agreed that the API suggested in the current PR would be more convenient for the reasons mentioned above but also to facilitate the implementation of the recent contribution of Angel's work. Separating the .predict and .score for example allows to modify CPI for averaging either at the prediction or loss level with minimal code changes.

jpaillard commented 1 month ago

The coverage decrease is related to #18, we are no longer testing Dnn and ModifiedRF which were part of the previous integrated in the previous BBI. Not sure if I should address it in this PR or a subsequent one Otherwise, this PR is ready for review @bthirion

bthirion commented 1 month ago

Leave me a bit of time, it is a big one...

bthirion commented 1 month ago

@achamma723 your opinion is welcome.

jpaillard commented 1 month ago

I tried to address all your comments @bthirion For me it is ready to merge.

achamma723 commented 1 month ago

Hello @bthirion and @jpaillard, sorry I weren't able to see all the comments in the past week. I'll try this weekend to have a look if you didn't decide to merge yet

achamma723 commented 1 month ago

Hello @jpaillard and @bthirion, I highlighted minor comments when passing, overall I think it is a great stable step for the future benchmarks. As for the cpi, now it is limited to the idea of reconstructing the variable or group of interest by the mean of the residuals, thus maybe the idea of the sampling (that existed via the Modified RF) is also interesting to push in the next steps?

jpaillard commented 1 month ago

Thx for the comments @achamma723. I made the modifications you suggested. Indeed it could be worth adding the sampling from nodes of the RF. Maybe we could tackle this in a following PR ?

jpaillard commented 1 month ago

Ready to merge on my side @bthirion

jpaillard commented 1 month ago

I added the additional tests and docstring @bthirion

mind-inria / hidimstat

Refactor CPI #14

Codecov Report