mind-inria / hidimstat

HiDimStat: High-dimensional statistical inference tool for Python
https://mind-inria.github.io/hidimstat
BSD 2-Clause "Simplified" License
4 stars 4 forks source link

Refactor CPI #14

Closed jpaillard closed 1 month ago

jpaillard commented 2 months ago

I suggest a refactoring of the CPI functionality. It is inspired by the current implementation of permutation importance in scikit-learn. It aims at:

I also updated the example using the diabetes dataset.

Linked to issues #12 and #13

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 99.06250% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.35%. Comparing base (b42572e) to head (3ffa1b5). Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
hidimstat/cpi.py 97.50% 2 Missing :warning:
hidimstat/loco.py 98.36% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #14 +/- ## =========================================== - Coverage 91.79% 76.35% -15.45% =========================================== Files 44 46 +2 Lines 2926 2398 -528 =========================================== - Hits 2686 1831 -855 - Misses 240 567 +327 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jpaillard commented 2 months ago

I tried to address the different comments:

jpaillard commented 1 month ago

Also linked to #17 After discussing with @AngelReyero we agreed that the API suggested in the current PR would be more convenient for the reasons mentioned above but also to facilitate the implementation of the recent contribution of Angel's work. Separating the .predict and .score for example allows to modify CPI for averaging either at the prediction or loss level with minimal code changes.

jpaillard commented 1 month ago

The coverage decrease is related to #18, we are no longer testing Dnn and ModifiedRF which were part of the previous integrated in the previous BBI. Not sure if I should address it in this PR or a subsequent one Otherwise, this PR is ready for review @bthirion

bthirion commented 1 month ago

Leave me a bit of time, it is a big one...

bthirion commented 1 month ago

@achamma723 your opinion is welcome.

jpaillard commented 1 month ago

I tried to address all your comments @bthirion For me it is ready to merge.

achamma723 commented 1 month ago

Hello @bthirion and @jpaillard, sorry I weren't able to see all the comments in the past week. I'll try this weekend to have a look if you didn't decide to merge yet

achamma723 commented 1 month ago

Hello @jpaillard and @bthirion, I highlighted minor comments when passing, overall I think it is a great stable step for the future benchmarks. As for the cpi, now it is limited to the idea of reconstructing the variable or group of interest by the mean of the residuals, thus maybe the idea of the sampling (that existed via the Modified RF) is also interesting to push in the next steps?

jpaillard commented 1 month ago

Thx for the comments @achamma723. I made the modifications you suggested. Indeed it could be worth adding the sampling from nodes of the RF. Maybe we could tackle this in a following PR ?

jpaillard commented 1 month ago

Ready to merge on my side @bthirion

jpaillard commented 1 month ago

I added the additional tests and docstring @bthirion