grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
957 stars 250 forks source link

Output tree$nodes[[i]]$samples #258

Closed predt closed 6 years ago

predt commented 6 years ago

Hello @jtibshirani Quick question: For a given final node "i" of a tree (i.e. a leaf), does the output tree$nodes[[i]]$samples correspond to the observations of the training sub sample used to build the tree (i.e. J1 in paper) falling in that leaf, or are they the observations from the other sub sample (J2) falling in that leaf? Thanks!

@predt I'm sorry I missed your question earlier! Would you be able to open a new issue with this question, and I will add a detailed answer there? Keeping each issue scoped to one topic helps ensure that other users with the same question will be able to find the answer as well. To answer briefly, that vector only contains examples from the second subsample (J2).

Thanks, @jtibshirani. Since tree$nodes[[i]]$samples corresponds to J2, the complement in "drawn_samples" should give me the set of samples in J1. Is that correct? I'm working in the appendix of an application of the GRF. I'm using a tree example figure to make more pedagogical the explanation of building a tree. I wanted to add the theta.hat.P values that results after splitting of a node ( theta.hat.P is the notation in the paper) to illustrate how splits favor heterogeneity in the context of a generalized causal forest. That is the reason of looking for the J1 samples. Thanks.

jtibshirani commented 6 years ago

You're right, drawn_samples will include all samples that went into constructing the tree. If honesty is enabled, this set includes both the samples used to perform splits (J1), and the samples that populate the leaf nodes (J2). If honesty is not enabled, these two sets are the same, and drawn_samples will be equal to the union of all samples in the leaf nodes.

I've kept this issue open and tagged it with 'documentation', so we remember to add an explanation to get_tree about the different list elements that are returned.

susanathey commented 6 years ago

It would be better to keep track of which is which (J1 and J2) for the use case of using the results from a single tree; may matter for different methods of calculating standard errors as well.

jtibshirani commented 6 years ago

@susanathey to clarify the exchange above, because you have access to both the leaf samples of a tree, and the overall 'drawn samples' for that tree, both J1 and J2 can be calculated fairly easily. In particular, J2 can be calculated by taking the union of all samples in nodes[[i]]$samples, then J1 can be found by taking the difference of drawn_samples and J2.

My intuition is that unless accessing both J1 and J2 is part of a common (and performance-sensitive) workflow, we shouldn't return those sets separately to avoid duplicating the same set in J1 and J2 when honesty isn't enabled. Let me know if that seems off.

susanathey commented 6 years ago

@jtibshirani Sorry I misunderstood. Maybe we can post a code sample and/or add it to our testing or demo code for users who might want to access them.

jtibshirani commented 6 years ago

I've updated the documentation in #268.