predict-idlab / powershap

A power-full Shapley feature selection method.
Other
196 stars 19 forks source link

Reduce memory usage by removing data copies #38

Closed lorenjan closed 1 year ago

lorenjan commented 1 year ago

These modifications make it possible to run PowerShap on much larger datasets without running out of memory. This is especially useful in places such as Kubernetes clusters which may terminate your process if you go over a memory limit.

lorenjan commented 1 year ago

Hello! Looking forward to your reviews and thoughts about these changes. Hopefully it's interesting to see what I had to do to be able to run on larger data sets. (Millions of rows, thousands of columns...)

There were some data structure copies that were being made in a couple of places which were taking a lot of extra memory, and I removed them, but maybe this is not safe to do for some reason? Please let me know if I did anything wrong and I will happily improve the code if I can.

codecov-commenter commented 1 year ago

Codecov Report

Merging #38 (c517bf6) into main (6b54fb7) will increase coverage by 0.08%. The diff coverage is 100.00%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
+ Coverage   83.87%   83.95%   +0.08%     
==========================================
  Files           6        6              
  Lines         372      374       +2     
==========================================
+ Hits          312      314       +2     
  Misses         60       60              
Impacted Files Coverage Δ
powershap/powershap.py 68.67% <100.00%> (ø)
powershap/shap_wrappers/shap_explainer.py 96.83% <100.00%> (+0.04%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

jvdd commented 1 year ago

If you add some more comments about the reasoning behind this https://github.com/predict-idlab/powershap/pull/38#discussion_r1189452588, this PR is going in :)

jvdd commented 1 year ago

This is a major contribution to powershap @lorenjan

I enjoyed discussing the code changes with you & I really appreciate the effort you put into this :handshake: