Variable Importance in Random Forest Analysis

jblindsay commented 2 years ago

I believe it is common in Random Forest analyses for the variable importance to be reported. For example, variable importance can be determined using the mean decrease in accuracy that occurs when each variable is removed, or using the Gini impurity stat. I may be mistaken, but I do not currently see any means by which this information can be measured using the current SmartCore API. I believe this would make a very valuable addition to the library.

tushushu commented 5 months ago

Hi there, does this issue still need a contribution, and may I work on it? @jblindsay @Mec-iS

tushushu commented 5 months ago

The Random Forest feature importance is calculated by compute_feature_importances function and other functions in sklearn, which is implemented by Cython.

    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef float64_t normalizer = 0.

        cdef cnp.float64_t[:] importances = np.zeros(self.n_features)

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importances[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        for i in range(self.n_features):
            importances[i] /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                for i in range(self.n_features):
                    importances[i] /= normalizer

        return np.asarray(importances)

And calculate the node impurity by function cdef float64_t node_impurity(self) noexcept nogil, which supports MSE, MAE, Gini, Poisson and cross-entropy.

I am going to check how we can implement similar functions in smartcore.

tushushu commented 5 months ago

The way to calculate feature importance can be found in this article. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

tushushu commented 5 months ago

This issue can be split into 3 sub tasks. I am working on the first one, currently. See https://github.com/tushushu/smartcore/tree/wip-issue-124

[x] Implement the feature importance for Decision Tree Classifier
[ ] Implement the feature importance for Decision Tree Regressor
[ ] Implement the feature importance for Random Forest

smartcorelib / smartcore

Variable Importance in Random Forest Analysis #124