grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
957 stars 248 forks source link

Consistency between tune_causal_forest and causal_forest(...,tune.parameters = T) in tuning #349

Closed adeldaoud closed 5 years ago

adeldaoud commented 5 years ago

Description of the bug A) Tuning via tune_causal_forest(...) produce different results compared to tuning via causal_forest(...,tune.parameters = T). These tunning operations produce different mtry, alpha, and imbalance parameters. Is this becuse they use different number of trees? Which tuning approach should one select?

B) Aside: the manual is unclear about which default parameters are used. I figured from the function arguments that sample.fraction = 0.5, but the other parameters ahve NULL defaults. Does it simply mean that there are no parameters tuned? For example, "mtry = NULL" implies that RF will try the Xs for each split and not randomly evaluate two Xs, which is the common case?

Steps to reproduce

## Not run:
# Find the optimal tuning parameters.
n = 50; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
params = tune_causal_forest(X, Y, W)$params

# Use these parameters to train a regression forest.
tuned.forest = causal_forest(X, Y, W, num.trees = 1000,
                             min.node.size = as.numeric(params["min.node.size"]),
                             sample.fraction = as.numeric(params["sample.fraction"]),
                             mtry = as.numeric(params["mtry"]),
                             alpha = as.numeric(params["alpha"]),
                             imbalance.penalty = as.numeric(params["imbalance.penalty"]))

## End(Not run)

# tune within causal_forst
tuned.forest1 = causal_forest(X, Y, W, num.trees = 1000, tune.parameters = T)

# compare
tuned.forest$tunable.params
tuned.forest1$tunable.params

> tuned.forest$tunable.params
    min.node.size   sample.fraction              mtry             alpha imbalance.penalty 
       1.00000000        0.50000000       10.00000000        0.11053671        0.01897803 
> tuned.forest1$tunable.params
  sample.fraction     min.node.size              mtry             alpha imbalance.penalty 
       0.50000000        1.00000000        5.00000000        0.14455407        0.02364402 

GRF version grf_0.10.2

adeldaoud commented 5 years ago

A side questions: can we control or tune the tree depths RF should consider? From the variable importance function, I gathered that grf use max.depth = 4.

adeldaoud commented 5 years ago

@jtibshirani after have read the reference you sent in #351, I believe I can answer my own question in A). Are we getting different results because we are drawing random points from the possible parameter space? That is,

"Draw a number of random points in the space of possible parameter values. By default, 100 distinct sets of parameter values are chosen"

So if this is the case, is this then an indication that we need to draw and evaluate more parameters to arrive approximately at the same points (assuming there is a global optimum)?

jtibshirani commented 5 years ago

Hi @adeldaoud, here are some responses to your questions.

Tuning via tune_causal_forest(...) produce different results compared to tuning via causal_forest(...,tune.parameters = T).

There is an important difference between the two methods. Using causal_forest with tune.parameters = TRUE will perform orthogonalization on the outcome + treatment before tuning a causal forest. In contrast, tune_causal_forest does not perform this orthogonalization (and assumes that the outcome + treatment have already been orthogonalized).

I'm sorry this caused confusion, the difference really isn't clear from the documentation or method signature. We'll make sure to update tune_causal_forest to be consistent with causal_forest, so that it will work with the orthogonalized values as well. For now I would recommend using causal_forest for tuning.

Aside: the manual is unclear about which default parameters are used.

To find the parameter defaults, I would recommend looking at the validation methods in input_utilities.R. You can also find the default values in the algorithm reference. The following parameters can be tuned automatically: min.node.size, sample.fraction, and mtry, alpha and imbalance.penalty. If tune.parameters is set to true and these are left as NULL, then their values will be selected automatically instead of using the defaults.

A side questions: can we control or tune the tree depths RF should consider?

The maximum tree depth cannot be controlled directly during training. The most closely related parameter that can be specified is min.node.size. The default value of max.depth in the variable_importance function is fairly arbitrary, and doesn't correspond to any limit placed on depth during training.

adeldaoud commented 5 years ago

Thanks @jtibshirani. Your reply is clarifying.

Your remark and recommendation about orthogonalization in the tuning process made me think about how I should best estimate my model with grf. This following passage makes me wonder about two things:

In GRF, we avoid this difficulty by 'orthogonalizing' our forest using Robinson's transformation (Robinson, 1988). Before running causal_forest, we compute estimates of the propensity scores e(x) = E[W|X=x] and marginal outcomes m(x) = E[Y|X=x] by training separate regression forests and performing out-of-bag prediction. We then compute the residual treatment W - e(x) and outcome Y - m(x), and finally train a causal forest on these residuals. If propensity scores or marginal outcomes are known through prior means (as might be the case in a randomized trial) they can be specified through the training parameters W.hat and Y.hat. In this case, causal_forest will use these estimates instead of training separate regression forests.

1) I can see that orthogonalization works fine when the scale of the outcome and thus the error are the same (e.g. a continous variable such as income). But how does orthogonalization affecting my estimation when the outcome is binary and the error is continous? I am essentially shaving off some probablity (the error) from my outcome (binary). Will this make my estimation less-efficient (more variance)?

2) I assume that the propensity score , e(x) = E[W|X=x], uses all the Xs in my dataframe to be relevant for treatment assingment. In my causal graph (an observational study still), only a subset of my Xs are relevant for treatment assignment. I read, somewhere in the manual (I belive it was under regression_trees), that I can conduct a two-stage procedure predicting W.hat and Y.hat seperatly. Correct? First, related to question 1), would this procedure likely produce a less-noisy estimator? Second, what tuning procedure, if any, should one use for the first stage estimation ( in other words, should one have seperate or joint tuning procedures for the first and second stage estimation).

Thanks

swager commented 5 years ago

Good questions.

  1. Yes--this type of orthogonalization is valid for both binary and continuous outcomes. All we need for this to work is a good estimate of m(x) = E[Y | X = x], and this quantity is unambiguously defined whether Y be continuous or binary. By default, we estimate m(x) via a regression forest. If Y is binary, we could use specialized methods (e.g., probability estimates from a logistic regression), but there is no need to do so.

  2. If you know that e(.) only depends on a subset of the features, then only using that subset of features to learn e(.) is a good idea. Then you can pass those W.hat predictions to the causal forest. For tuning the e(.) and m(.) models, we recommend separately cross-validating each fit for predictive accuracy. We haven't yet found a different ad hoc tuning strategy that yields systematically better results.

The original issue has now been solved and the discussion is getting far from the original question, so I'm closing the issue.