Closed predt closed 6 years ago
You're right, drawn_samples
will include all samples that went into constructing the tree. If honesty is enabled, this set includes both the samples used to perform splits (J1
), and the samples that populate the leaf nodes (J2
). If honesty is not enabled, these two sets are the same, and drawn_samples
will be equal to the union of all samples in the leaf nodes.
I've kept this issue open and tagged it with 'documentation', so we remember to add an explanation to get_tree
about the different list elements that are returned.
It would be better to keep track of which is which (J1 and J2) for the use case of using the results from a single tree; may matter for different methods of calculating standard errors as well.
@susanathey to clarify the exchange above, because you have access to both the leaf samples of a tree, and the overall 'drawn samples' for that tree, both J1
and J2
can be calculated fairly easily. In particular, J2
can be calculated by taking the union of all samples in nodes[[i]]$samples
, then J1
can be found by taking the difference of drawn_samples
and J2
.
My intuition is that unless accessing both J1
and J2
is part of a common (and performance-sensitive) workflow, we shouldn't return those sets separately to avoid duplicating the same set in J1
and J2
when honesty isn't enabled. Let me know if that seems off.
@jtibshirani Sorry I misunderstood. Maybe we can post a code sample and/or add it to our testing or demo code for users who might want to access them.
I've updated the documentation in #268.
Hello @jtibshirani Quick question: For a given final node "i" of a tree (i.e. a leaf), does the output tree$nodes[[i]]$samples correspond to the observations of the training sub sample used to build the tree (i.e. J1 in paper) falling in that leaf, or are they the observations from the other sub sample (J2) falling in that leaf? Thanks!
@predt I'm sorry I missed your question earlier! Would you be able to open a new issue with this question, and I will add a detailed answer there? Keeping each issue scoped to one topic helps ensure that other users with the same question will be able to find the answer as well. To answer briefly, that vector only contains examples from the second subsample (J2).
Thanks, @jtibshirani. Since tree$nodes[[i]]$samples corresponds to J2, the complement in "drawn_samples" should give me the set of samples in J1. Is that correct? I'm working in the appendix of an application of the GRF. I'm using a tree example figure to make more pedagogical the explanation of building a tree. I wanted to add the theta.hat.P values that results after splitting of a node ( theta.hat.P is the notation in the paper) to illustrate how splits favor heterogeneity in the context of a generalized causal forest. That is the reason of looking for the J1 samples. Thanks.