serengil / chefboost

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python
https://www.youtube.com/watch?v=Z93qE5eb6eg&list=PLsS_1RYmYQQHp_xZObt76dpacY543GrJD&index=3
MIT License
456 stars 101 forks source link

negative feature importance #57

Closed dcguim closed 4 months ago

dcguim commented 5 months ago

After running the C4.5 algorithm I get negative values for feature importance, I find a bit confusing as feature importance is calculated essentially as: the total amount of entropy from the parent node that the current node split can explain or "organize" into more homogenic nodes. If I am getting a negative feature importance it means that the node split is essentially creating entropy or less homogeneity which makes no sense. On top of that, these features, with negative importances, are being used in the rules.py. Additionally, there is evidence in the literature that "Every time a node is split on variable, the combined impurity for the two descendent nodes is less than the parent node." Unbiased Measurement of Feature Importance in Tree-Based Methods - Split Improvement Any chance you could explain this behavior? DISCLAIMER: I can't share the data, as it is private.

serengil commented 5 months ago

would you please share your dataset to reproduce same feature importance?

serengil commented 4 months ago

Closed due to inactivity