Open jblindsay opened 2 years ago
Hi there, does this issue still need a contribution, and may I work on it? @jblindsay @Mec-iS
The Random Forest feature importance is calculated by compute_feature_importances
function and other functions in sklearn, which is implemented by Cython.
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef float64_t normalizer = 0.
cdef cnp.float64_t[:] importances = np.zeros(self.n_features)
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importances[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
for i in range(self.n_features):
importances[i] /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
for i in range(self.n_features):
importances[i] /= normalizer
return np.asarray(importances)
And calculate the node impurity by function cdef float64_t node_impurity(self) noexcept nogil
, which supports MSE, MAE, Gini, Poisson and cross-entropy.
I am going to check how we can implement similar functions in smartcore
.
The way to calculate feature importance can be found in this article. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
This issue can be split into 3 sub tasks. I am working on the first one, currently. See https://github.com/tushushu/smartcore/tree/wip-issue-124
I believe it is common in Random Forest analyses for the variable importance to be reported. For example, variable importance can be determined using the mean decrease in accuracy that occurs when each variable is removed, or using the Gini impurity stat. I may be mistaken, but I do not currently see any means by which this information can be measured using the current SmartCore API. I believe this would make a very valuable addition to the library.