smartcorelib / smartcore

A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning.
https://smartcorelib.org/
Apache License 2.0
672 stars 76 forks source link

Variable Importance in Random Forest Analysis #124

Open jblindsay opened 2 years ago

jblindsay commented 2 years ago

I believe it is common in Random Forest analyses for the variable importance to be reported. For example, variable importance can be determined using the mean decrease in accuracy that occurs when each variable is removed, or using the Gini impurity stat. I may be mistaken, but I do not currently see any means by which this information can be measured using the current SmartCore API. I believe this would make a very valuable addition to the library.

tushushu commented 5 months ago

Hi there, does this issue still need a contribution, and may I work on it? @jblindsay @Mec-iS

tushushu commented 5 months ago

The Random Forest feature importance is calculated by compute_feature_importances function and other functions in sklearn, which is implemented by Cython.

    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef float64_t normalizer = 0.

        cdef cnp.float64_t[:] importances = np.zeros(self.n_features)

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importances[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        for i in range(self.n_features):
            importances[i] /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                for i in range(self.n_features):
                    importances[i] /= normalizer

        return np.asarray(importances)

And calculate the node impurity by function cdef float64_t node_impurity(self) noexcept nogil, which supports MSE, MAE, Gini, Poisson and cross-entropy.

I am going to check how we can implement similar functions in smartcore.

tushushu commented 5 months ago

The way to calculate feature importance can be found in this article. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

tushushu commented 5 months ago

This issue can be split into 3 sub tasks. I am working on the first one, currently. See https://github.com/tushushu/smartcore/tree/wip-issue-124