catboost-plot-pimp-shapimp

ThomasBury commented 4 years ago

Modifications are:

catboost checks (doesn't allow to change random seed after fitting), cases for catboost estimator
plot: boxplot of the var. imp history (confirmed, tentative and rejected are colour-coded)
categorical feature for boosting, plot (mimicking more or less the R one), OHE is known to lead to deep and unstable trees. LightGBM and CatBoost have native methods to handle cat. pred, this requires to declare which columns are cat (lightgbm has an auto mode but the columns should be integer encoded and dtype set to category)
permutation importance and SHAP importance, (the impurity based var.imp being biased towards large card. and numerical predictors and computed on the train set)
testing: modif_tests.py for classification and regression testing (including a random cat. pred with large cardinality)

thanks KR

ThomasBury commented 4 years ago

To summarize, this PR solves/enhances:

The categorical features (they are detected, encoded. The tree-based models are working better with integer encoding rather than with OHE, which leads to deep and unstable trees). If Catboost is used, then the cat.pred (if any) are set up
Work with Catboost sklearn API
Allow using sample_weight, for applications like Poisson regression or any requiring weights
3 different feature importances: native, SHAP and permutation. Native being the least consistent (because of the imp. biased towards numerical and large cardinality categorical) but the fastest of the 3.
Using lightGBM as default speed up by an order of magnitude the running time
Visualization like in the R package

brunofacca commented 4 years ago

Hi @danielhomola. First, let me thank you for this great library.

The changes in this PR would be very useful to me. Do you plan to merge?

PS: here is a writeup about the limitations of gini as a feature importance metric.

ThomasBury commented 4 years ago

@brunofacca Meanwhile they consider to merge it or not, you can have a look at the Boruta_shap package. It implements almost all the features of my PR (I also discussed with its author to fix some issues and the possible merge with Boruta_py, which could be beneficial to avoid any confusion and be under the scikit contrib flag). I hope it helps.

brunofacca commented 4 years ago

Thank you @ThomasBury, for this PR and for the tip. I'm actually testing the Boruta_shap package and it looks great except that it still has low test coverage and a smaller number of contributors. Of course, that is likely to improve as the project matures.

brunofacca commented 4 years ago

Did the maintainers consider merging these changes? Is there a decision? Thank you.

ThomasBury commented 3 years ago

Did the maintainers consider merging these changes? Is there a decision? Thank you.

@brunofacca if you're still interested, I packaged 3 All Relevant FS methods here: https://github.com/ThomasBury/arfs still incubating but there are some (unit)tests and the doc is there (I guess there are too many dependencies to be compliant with the scikit-learn contrib requirements). I'm supervising a master thesis, the goal being to study the properties of those methods (so the package is likely to evolve over time).

danielhomola commented 3 years ago

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

ThomasBury commented 3 years ago

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

Ok thanks for the kind words. Would you be interested in a PR in pure sklearn (so only sklearn estimators, native and permutation importance)? That would mean the package to be more than boruta, a sklearn-flavoured all relevant FS package, instead of the opposite strategy of having different packages and a wrapper on top of them, if compliant with sk-contrib requirements? Or perhaps just the PR with the permutation importance (slower but more accurate and still relevant for small/mid-size data sets)? If not, no worries (I prefer having your opinion before starting anything ^^)

brunofacca commented 3 years ago

Thank you, @ThomasBury! That's a very nice library and it's is likely to fill what I consider to be a gap in the current ML tooling: I've tried dozens of feature selection strategies (including those that are considered "state of the art") and none of them were effective for a high-dimensional dataset where even the most relevant features are quite noisy. Your strategy of running XGBoost classifier on the entire data set ten times (on BoostARoota), for example, is a nice step towards a more effective feature selection strategy for data with a low signal-to-noise ratio.

Would be very interesting to eventually see a comparison between classification performance with subsets of features selected by each of those 3 strategies. I will also give them a try in the near future.

danielhomola commented 3 years ago

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

we add it as a difference from the original paper in the readme
if it's optional and easy to switch off
if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

ThomasBury commented 3 years ago

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

we add it as a difference from the original paper in the readme

if it's optional and easy to switch off

if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

Hi @danielhomola,

I submitted a PR:

optional pimp via an argument
If shap is installed (soft dependency, try-catch import) shap feature importance can be used. Not required for installing boruta
no hard dependencies added, all optional and compatibility check using try-catch on import + error message
chart like the original, only if matplotlib is installed (try-catch import, mpl is not mandatory for installing boruta)
add sample_weight support
categorical feature support (based on the lightGBM recommendation)
Notebook with examples, with genuine predictors and artificial ones to check if the right ones are accepted/rejected
Message added to the Readme.

With this, you pretty much have the same version than in the ARFS package but without any hard dependencies.

KR Thomas

scikit-learn-contrib / boruta_py

catboost-plot-pimp-shapimp #77