scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.5k stars 256 forks source link

catboost-plot-pimp-shapimp #77

Closed ThomasBury closed 3 years ago

ThomasBury commented 4 years ago

Modifications are:

thanks KR

ThomasBury commented 4 years ago

To summarize, this PR solves/enhances:

brunofacca commented 4 years ago

Hi @danielhomola. First, let me thank you for this great library.

The changes in this PR would be very useful to me. Do you plan to merge?

PS: here is a writeup about the limitations of gini as a feature importance metric.

ThomasBury commented 4 years ago

@brunofacca Meanwhile they consider to merge it or not, you can have a look at the Boruta_shap package. It implements almost all the features of my PR (I also discussed with its author to fix some issues and the possible merge with Boruta_py, which could be beneficial to avoid any confusion and be under the scikit contrib flag). I hope it helps.

brunofacca commented 4 years ago

Thank you @ThomasBury, for this PR and for the tip. I'm actually testing the Boruta_shap package and it looks great except that it still has low test coverage and a smaller number of contributors. Of course, that is likely to improve as the project matures.

brunofacca commented 4 years ago

Did the maintainers consider merging these changes? Is there a decision? Thank you.

ThomasBury commented 3 years ago

Did the maintainers consider merging these changes? Is there a decision? Thank you.

@brunofacca if you're still interested, I packaged 3 All Relevant FS methods here: https://github.com/ThomasBury/arfs still incubating but there are some (unit)tests and the doc is there (I guess there are too many dependencies to be compliant with the scikit-learn contrib requirements). I'm supervising a master thesis, the goal being to study the properties of those methods (so the package is likely to evolve over time).

danielhomola commented 3 years ago

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

ThomasBury commented 3 years ago

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

Ok thanks for the kind words. Would you be interested in a PR in pure sklearn (so only sklearn estimators, native and permutation importance)? That would mean the package to be more than boruta, a sklearn-flavoured all relevant FS package, instead of the opposite strategy of having different packages and a wrapper on top of them, if compliant with sk-contrib requirements? Or perhaps just the PR with the permutation importance (slower but more accurate and still relevant for small/mid-size data sets)? If not, no worries (I prefer having your opinion before starting anything ^^)

brunofacca commented 3 years ago

Thank you, @ThomasBury! That's a very nice library and it's is likely to fill what I consider to be a gap in the current ML tooling: I've tried dozens of feature selection strategies (including those that are considered "state of the art") and none of them were effective for a high-dimensional dataset where even the most relevant features are quite noisy. Your strategy of running XGBoost classifier on the entire data set ten times (on BoostARoota), for example, is a nice step towards a more effective feature selection strategy for data with a low signal-to-noise ratio.

Would be very interesting to eventually see a comparison between classification performance with subsets of features selected by each of those 3 strategies. I will also give them a try in the near future.

danielhomola commented 3 years ago

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

ThomasBury commented 3 years ago

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

  • we add it as a difference from the original paper in the readme
  • if it's optional and easy to switch off
  • if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

Hi @danielhomola,

I submitted a PR:

With this, you pretty much have the same version than in the ARFS package but without any hard dependencies.

KR Thomas