rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.22k stars 530 forks source link

[FEA] Add support for computing feature_importances in RF #3531

Open teju85 opened 3 years ago

teju85 commented 3 years ago

Is your feature request related to a problem? Please describe. RF implementation should support computing feature_importances_ property, just like how it is exposed in sklearn.

Describe the solution you'd like

  1. By default, we should compute normalized feature_importances_ (ie. all the importances across the features sum to 1.0)
  2. Implementation that is done in sklearn is here. We have all of this information in our Node. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.
JohnZed commented 3 years ago

Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that.

teju85 commented 3 years ago

Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sooryaa-thiruloga commented 3 years ago

we are interested to use this feature in our use case too.

beckernick commented 2 years ago

This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators

teju85 commented 2 years ago

Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this?

hafarooki commented 2 years ago

This is probably not the most efficient implementation, but in case anyone else needs it:

def calculate_importances(nodes, n_features):
    importances = np.zeros((len(nodes), n_features))
    feature_gains = np.zeros(n_features)

    def calculate_node_importances(node, i_root):
        if "gain" not in node:
            return

        samples = node["instance_count"]
        gain = node["gain"]
        feature = node["split_feature"]
        feature_gains[feature] += gain * samples

        for child in node["children"]:
            calculate_node_importances(child, i_root)

    for i, root in enumerate(nodes):
        calculate_node_importances(root, i)
        importances[i] = feature_gains / feature_gains.sum()

    return np.mean(importances, axis=0)

you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

beckernick commented 2 years ago

Cross linking an issue that asks for this feature and OOB support https://github.com/rapidsai/cuml/issues/3361

Wulin-Tan commented 2 years ago

it is an important issue worth a look.

HybridNeos commented 1 year ago

Commenting to re-iterate the usefulness for this feature. I was trying to follow https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html using cuml but it is not currently possible.

beckernick commented 1 year ago

A user shared a workflow today for which cuML's RF was 20x faster than their prior CPU-based RF. They wanted to use feature importance for feature selection, but weren't able to do so.

szeka94 commented 3 months ago

yea, I'm missing this too.

Avertemp commented 2 months ago

Same here. Switched to cuml for feature selection. a really needed feature.

Zach-Sten commented 2 months ago

Same here. Using RF and need feature importance