Open teju85 opened 3 years ago
Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that.
Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
we are interested to use this feature in our use case too.
This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators
Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this?
This is probably not the most efficient implementation, but in case anyone else needs it:
def calculate_importances(nodes, n_features):
importances = np.zeros((len(nodes), n_features))
feature_gains = np.zeros(n_features)
def calculate_node_importances(node, i_root):
if "gain" not in node:
return
samples = node["instance_count"]
gain = node["gain"]
feature = node["split_feature"]
feature_gains[feature] += gain * samples
for child in node["children"]:
calculate_node_importances(child, i_root)
for i, root in enumerate(nodes):
calculate_node_importances(root, i)
importances[i] = feature_gains / feature_gains.sum()
return np.mean(importances, axis=0)
you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
Cross linking an issue that asks for this feature and OOB support https://github.com/rapidsai/cuml/issues/3361
it is an important issue worth a look.
Commenting to re-iterate the usefulness for this feature. I was trying to follow https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html using cuml but it is not currently possible.
A user shared a workflow today for which cuML's RF was 20x faster than their prior CPU-based RF. They wanted to use feature importance for feature selection, but weren't able to do so.
yea, I'm missing this too.
Same here. Switched to cuml for feature selection. a really needed feature.
Same here. Using RF and need feature importance
Is your feature request related to a problem? Please describe. RF implementation should support computing
feature_importances_
property, just like how it is exposed in sklearn.Describe the solution you'd like
feature_importances_
(ie. all the importances across the features sum to 1.0)Node
. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.