grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
938 stars 250 forks source link

Variable importance () plot #1411

Closed hanneleer closed 2 weeks ago

hanneleer commented 2 months ago

Dear all,

I was wondering if someone could help me with the following:

I use the variable importance measure to describe which variables are chosen most often by the causal forest algorithm. However, now i want to know at which levels/values each variables tended to split (on average). Is there a possibility to grow a tree on the most important variables?

Thanks a lot already!

erikcs commented 2 months ago

Hi @hanneleer, you could calculate that using the function get_tree that gives you details on the split variable and level for every tree. You can also fit a new forest on the most important variables, Algorithm 1 here gives an example of that. For other visualizations you might find some of the example plots in this tutorial useful.

hanneleer commented 2 months ago

Thanks a lot for your response and insights @erikcs ! I would like to pose an additional question regarding the two possibilities you highlighted, if I may.

When running a new forest analysis on the most important variables, each tree typically splits based on the most influential variable, leading to diverse splits (I would suppose it never splits on the same variable first) across trees in the forest. With, say, 2000 trees, each might choose a different variable for its initial split and a different value for this variable to split on.

If I aim to visualize an aggregate visualization, reflecting the average of these splits and discerning which variable tends to be prioritized first across the first, facilitating insights into the policy's differential impacts. Is this something that is possible? Or am I limited to utilizing the get_tree function, which only provides a single tree from the forest?

Thanks a lot for your time!

erikcs commented 2 months ago

Hi @hanneleer, something like the heatmaps that visualize covariate levels across HTE predictions in the above tutorial link is typically what we'd recommend over focusing on on every single split in the forest.

hanneleer commented 2 months ago

I will dive deeper into this, thanks a lot! @erikcs

hanneleer commented 3 weeks ago

@erikcs I have a small question related to the heatmap that visualizes covariates levels across HTE predictions, and I would really appreciate your insights if possible.

I would like to rank observations into quintiles according to their estimated CATE prediction as you suggested previously. Imagine, I have individual and school-level characteristics, and I also cluster on schools. I understand that to ensure the model is not fit using individual (i's) or school (j's) data, we need to divide the data into K folds/clusters (in my case these are my schools) in order to rank them into groups.

However, when i look at my heatmap, I notice that for school characteristics, the average values over quintiles are the same. This makes sense, since individuals in the same school share the same school-level characteristics. Like this, we loose information in the analysis regarding school characteristics as heterogeneity drivers. Is there something that I can do about this or is this just not a valuable analysis since I cluster on schools already and therefore lose the importance of school characteristics?

Thanks a lot for your time!

erikcs commented 3 weeks ago

Yes, it's a good idea to do sample splitting where you fit CATEs on one subset and evaluate them on another (here is an example of how you could draw clusters at random for training/evaluation in case that wasn't obvious).

For HTEs, you can still use predictor variables that vary at the cluster level, but whether they capture meaningful heterogeneity and/or differ along predicted treatment effect is going to be problem-specific.