So sample.weights and min.node.size work together?

ferlocar commented 4 years ago

I'm trying to use grf with a large data set (770 million observations) and a few features. In order to process the large amount of data, I've grouped observations by all possible combinations of the features. I then proceeded to use the groups as 'the observations' and the number of observations per group as the sample weights. Now I want to tune the forest by trying different values for min.node.size, and my question is as follows:

Does min.node.size accounts for sample.weights? For example, could there be a leaf with a single observation with weight 10 even though min.node.size is 3?

Thanks in advance for your help!

davidahirshberg commented 4 years ago

It currently does not.

On Thu, Nov 14, 2019 at 7:29 PM Carlos Fernandez notifications@github.com wrote:

I'm trying to use grf with a large data set (770 million observations) and a few features. In order to process the large amount of data, I've grouped observations by all possible combinations of the features. I then proceeded to use each group as 'an observation' and the number of observations per group as the sample weights. Now I want to tune the forest by trying different values for min.node.size, and my question is as follows:

Does min.node.size accounts for sample.weights? For example, could there be a leaf with a single observation with weight 10 even though min.node.size is 3?

Thanks in advance for your help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grf-labs/grf/issues/570?email_source=notifications&email_token=AANV7XCLVJBRHNYXI7APTLLQTYJQTA5CNFSM4JNU7YY2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZP4H5Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANV7XEA6MI7LWCCKZUCFCLQTYJQTANCNFSM4JNU7YYQ .

ferlocar commented 4 years ago

Oh, that's a shame. Could you please tag this as a feature request, or should I create a separate issue for that and close this one?

jtibshirani commented 4 years ago

@ferlocar thanks for your feedback, I tagged this issue as something we should look into.

swager commented 4 years ago

@ferlocar that sounds like a reasonable way to compress your dataset. GRF should mostly support that workflow: We take the weights into account during prediction, and are actively preparing a PR that will take weights into account during splitting.

One spot where we do not currently take weights into account is with min.node.size (i.e., each observation gets counted once, regardless of its weight). At a high level, min.node.size is a rather indirect (and not necessarily easily interpretable) parameter that lets one regularize a forest (making it bigger makes the forest more stable -- kind of like "lambda" in the lasso). But I'm not aware of any theory motivating the precise definition of min.node.size, so in that sense it's not clear to me whether there is a principled or "universal" generalization of min.node.size to the case with weighted observations. In general, I'd recommend just cross-validating over min.node.size until you find something that works well.

Finally, one thing that might help with the compression scheme you're using is: Instead of collapsing all observations with the same covariate value into a single datapoint, you could collapse them into, say, 20 datapoints (e.g., stratified by cookie value of something). The advantage of doing this is that GRF does a lot of sample splitting under the hood, and if you collapse all observations with the same covariate value into a single point then all those observations will repeatedly be left out of bag and not taken into account by the corresponding trees at all. In contrast, if you partition these observations into a handful of datapoints, it's less likely that whole region of covariate space will be left out of bag.

ferlocar commented 4 years ago

@jtibshirani thank you! I really appreciate it!

@swager Thanks for your thoughtful comments. The reason I am interested in using min.node.size is that I want to compare the treatment assignments made by the causal forest to the assignments made by other tree-based methods. For example, I'm comparing the assignments made by the causal forest to the assignments made by a typical random forest (with which assignments are made based on the difference between the response prediction when treated and the response prediction when untreated). In most of these other "non-causal" methods, there is usually an hyper-parameter that considers the weights one way or another (e.g., min_weight_fraction_leaf in the Random Forest used by sklearn in python). So, I was hoping this was the case with the causal forest to make a cleaner comparison.

That said, I did proceed the way you suggested (by cross-validating over min.node.size) but I was surprised to find that the causal forest is not performing better than a random forest trained with sklearn (regarding the quality of treatment assignments). However, now that you mention the sampling procedure, I'm going to try to collapse the data in several data points per covariate value (makes a lot of sense). Another potential explanation is the very large amount of data points, which allows me to perform so many splits and to have some many observations in each leaf level that perhaps it no longer matters whether I use outcome splitting instead of effect splitting. In any case, thanks for the amazing package!

grf-labs / grf

So sample.weights and min.node.size work together? #570