grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
971 stars 249 forks source link

Performance of regression_forest depends on covariate order #253

Closed halflearned closed 6 years ago

halflearned commented 6 years ago

TL;DR The output of regression_forest seems to depend on covariate order. In one example below, reversing the order of covariates in the matrix caused a ~40% performance loss.

Example overview

This minimal example reproduces the bug for different initial parameters.

We create a matrix X of size (n, p). Only the first regressor, X[,1] is relevant. In fact, it is equal to Y. Other columns are just noise.

We train the regression_forest on (X, Y) twice, with the same seed. The second time around, we feed X with its columns in reverse order (X[,p], ..., X[,1] with the relevant regressor coming in last).

Results

The performance , as measured by in-sample MSE, can be very different. The table below shows how bad the performance difference can be, for different hyper-parameters and datasets sizes. We call the usual order "A" and the reversed order "B".

Note how the dependence on mtry is also quite bizarre -- see the numbers in bold below. This suggests a problem in how variables are being randomly selected.

You can also modify the file above so that all other columns of X are not just noise, but constant, and the results remain the same.

n p num.trees mtry A B ratio
100 2 100 1 0.11 0.12 1.11
100 2 1000 1 0.11 0.11 1.04
500 2 100 1 0.02 0.02 0.90
500 2 1000 1 0.02 0.02 1.00
100 10 100 1 0.56 0.55 0.97
100 10 1000 1 0.56 0.59 1.06
100 10 100 10 0.07 0.08 1.21
100 10 1000 10 0.07 0.08 1.19
500 10 100 1 0.34 0.36 1.07
500 10 1000 1 0.35 0.34 0.98
500 10 100 10 0.01 0.01 1.27
500 10 1000 10 0.01 0.01 1.23
100 20 100 1 0.78 0.78 1.00
100 20 1000 1 0.76 0.76 1.00
100 20 100 10 0.22 0.19 0.88
100 20 1000 10 0.17 0.19 1.09
100 20 100 20 0.06 0.09 1.41
100 20 1000 20 0.06 0.09 1.38
500 20 100 1 0.53 0.58 1.10
500 20 1000 1 0.57 0.56 0.98
500 20 100 10 0.05 0.05 1.10
500 20 1000 10 0.04 0.05 1.17
500 20 100 20 0.01 0.01 1.45
500 20 1000 20 0.01 0.01 1.37
jtibshirani commented 6 years ago

Hi @halflearned, thank you for the detailed issue and the very clear reproduction!

I agree that the difference is likely due to our sampling strategy for selecting split variables. I did some digging, and came across what seems like the same issue in ranger (the project whose code we're originally based on): https://github.com/imbs-hl/ranger/issues/151. In short, we're using Knuth's algorithm, which preserves each variable's original order when producing a subset. Since each variable is considered in order when deciding what to split on, variables that appear first tend to be split on more frequently.

We'll plan to incorporate the same fix that was made in ranger, which is to switch to the Fisher Yates algorithm for sampling. I think the change should be pretty do-able for a newcomer, and ranger already provides an example of a fix. Is this something you'd be interested in picking up?

swager commented 6 years ago

Thanks @halflearned, good find!

halflearned commented 6 years ago

Hey @jtibshirani and @swager! The ranger issue linked above seems to be exactly the problem. I'll try to submit a patch by early next week.

halflearned commented 6 years ago

Folks, quick update. I've been working on this since last week, but things are a bit more complicated than I anticipated. As far as I can see, fixing the functions in the same way that ranger did it seems not to solve the issue entirely.

In fact, it may be the case that ranger actually never really solved their bug either.

I did a similar experiment using ranger: fix the number of regressors to p=20 and some 1 <= mtry <= 20, and feed the forest with regressors in two different orders. The performance is different depending on the order, although the sign and magnitude depend on mtry.

The code is in this gist and the results are the following.

Estimate Std. Error t value Pr(>|t|)
factor(mtry)1 -0.048621 0.00079 -61.575256 0.000000
factor(mtry)2 0.020944 0.00079 26.523676 0.000000
factor(mtry)3 0.021915 0.00079 27.754351 0.000000
factor(mtry)4 0.012012 0.00079 15.212834 0.000000
factor(mtry)5 0.029685 0.00079 37.593601 0.000000
factor(mtry)6 0.051650 0.00079 65.411244 0.000000
factor(mtry)7 0.046597 0.00079 59.012008 0.000000
factor(mtry)8 0.047763 0.00079 60.488170 0.000000
factor(mtry)9 0.098814 0.00079 125.141493 0.000000
factor(mtry)10 0.045762 0.00079 57.953985 0.000000
factor(mtry)11 0.033852 0.00079 42.871581 0.000000
factor(mtry)12 0.041696 0.00079 52.804769 0.000000
factor(mtry)13 -0.015003 0.00079 -19.000184 0.000000
factor(mtry)14 0.000018 0.00079 0.022355 0.982164
factor(mtry)15 0.004213 0.00079 5.335602 0.000000
factor(mtry)16 -0.012750 0.00079 -16.147290 0.000000
factor(mtry)17 -0.014471 0.00079 -18.326560 0.000000
factor(mtry)18 0.002406 0.00079 3.047162 0.002311
factor(mtry)19 0.000844 0.00079 1.069380 0.284901
factor(mtry)20 0.000675 0.00079 0.854686 0.392727

To me, the numbers above I so bizarre that I think there might be a problem with my gist. If anyone has the time to take a look at it, I'd appreciate it.

Meanwhile, I'll continue probing to see what is the problem. I've started contemplating the possibility that my C++ compiler is not producing uniformly random numbers, which would screw up our sampling results.

jtibshirani commented 6 years ago

Hi @halflearned, thanks for the update! I took a look at the gist and noticed you were hard-coding the random seed to 1234 (likely left over from some debugging). Without those lines, do you see more reasonable results?

halflearned commented 6 years ago

Sorry for the delay, looks like I was overcomplicating things. PR at #259.

--

By the way, @jtibshirani was right: after removing the hardcoded seed, ranger also produced fine results. But to be honest I'm not sure why that was wrong.

jtibshirani commented 6 years ago

Since the same random seed is used each iteration, it seems you're essentially running the same experiment n_sim = 5000 times, and averaging the results. I think under these conditions, we would expect to see a few significant differences in predictions just by chance.

halflearned commented 6 years ago

Hm, I don't know, since the on line 34 I "reset" the seed.

rm(.Random.seed, envir=globalenv())

But it's okay, we don't have to continue this discussion here :)

halflearned commented 6 years ago

Fixed in #259

tcovert commented 5 years ago

I apologize for reopening a closed issue, but is there a way to impose the Fisher-Yates sampling that is implemented in #259? I ask because I have a recent version of grf (0.10.4) in which I still observe that covariate order matters for regression forest predictions, and it seems that Fisher-Yates only kicks in if some condition is met, which I infer from this code in RandomSampler.cpp:

void RandomSampler::draw(std::vector<size_t>& result,
                         size_t max,
                         const std::set<size_t>& skip,
                         size_t num_samples) {
  if (num_samples < max / 10) {
    draw_simple(result, max, skip, num_samples);
  } else {
    draw_fisher_yates(result, max, skip, num_samples);
  }
}

To be totally clear, I don't know enough c++ to know what num_samples and max would refer to in the options I pass to regression_forest, so I can't even be sure that the problem I am seeing is due to this, or something else. However, my problem is comparably small to the one described in the above gist (n ~ 1000, p ~ 10), so my hypothesis is that if I am seeing a covariate order depend in my results, it must be this. Is my problem that 10 columns is too few for Fisher-Yates to matter?

Thanks in advance for any suggestions you have!

erikcs commented 5 years ago

Can you include an example?

lpgarcia18 commented 5 years ago

Hi! I think I'm facing the same problem.

I'm running an instrumental_forest and, even setting seeds, if I change the order of regressors, the results changes.

I'm not used to this level of discussion, so I apologize in advance for any mistake.

I cut part of my database and paste the code I'm using. If I change the order of "GOV_EFFECTIVENESS_LAGGED" and "GDP_PPP_LAGGED", the predicitons chage, as you can see in the table.

test dataset https://drive.google.com/file/d/126dfk9iDwDr0QxVBBfFfUU_Vj-CfPvQD/view?usp=sharing

test <- read_csv("test.CSV")

external_factors <- c("GOV_EFFECTIVENESS_LAGGED", "OOP_PPP_LAGGED", "GDP_PPP_LAGGED")

Y <- test$DELTA_RATE_U5
X <- dplyr::select(test, external_factors)
W <- test$PUBLIC_EXP_LAGGED
Z <- test$TAX_PPP_LAGGED
Y.forest <- regression_forest(X, Y, seed = 233)
Y.hat <- predict(Y.forest)$predictions
W.forest <- regression_forest (X , W, seed = 233)
W.hat <- predict(W.forest)$predictions
Z.forest <- regression_forest(X, Z, seed = 233)
Z.hat <- predict(Z.forest)$predictions

iv_raw <- instrumental_forest(X, Y, W, Z,
                    Y.hat, W.hat, Z.hat,
                    mtry = sqrt(ncol(X)),
                    num.trees = 10000,
                    honesty = T,
                    compute.oob.predictions = TRUE,
                    seed = 233)
varimp <- variable_importance(iv_raw)
selected_idx <- which(varimp > mean(varimp))
iv <- instrumental_forest(X[,selected_idx], Y, W, Z,
                    Y.hat, W.hat, Z.hat,
                    mtry = sqrt(ncol(X)),
                    num.trees = 10000,
                    honesty = T,
                    compute.oob.predictions = TRUE,
                    seed = 233)
result <- predict(iv, newdata = NULL, estimate.variance = TRUE, set.seed(233))
result

10th Predictions

"GOV_EFFECTIVENESS_LAGGED", "OOP_PPP_LAGGED", "GDP_PPP_LAGGED" "GDP_PPP_LAGGED","OOP_PPP_LAGGED", "GOV_EFFECTIVENESS_LAGGED"
-7.27798E-05 -5.8594E-05
-6.73034E-05 -5.74381E-05
-6.8214E-05 -5.71053E-05
-6.86638E-05 -5.96793E-05
-6.60211E-05 -5.46936E-05
-3.24378E-05 -2.51731E-05
-3.51665E-05 -2.89873E-05
-6.4615E-05 -5.65659E-05
-6.32857E-05 -5.39281E-05
-6.54502E-05 -5.4964E-05
halflearned commented 5 years ago

Hi @tcovert and @lpgarcia.

It's not possible for grf results to remain the same if the covariate order is changed. This means that there's no reason to expect, for example, that two forests fitted as

tau.forest1 <- causal_forest(X[,c(1,2,3)], Y, W, seed=1234)
tau.forest2 <- causal_forest(X[,c(2,1,3)], Y, W, seed=1234)  # switched order!

should give the exact same predictions, even though they share the same seed, because there is no way for grf to "know" that I changed the covariate order when creating the trees. This has nothing to do with which sampling algorithm is used.

If this is a concern, I recommend that you simply try growing a larger number of trees. By averaging over many trees, the effect of covariate order should vanish.

lpgarcia18 commented 5 years ago

Thank you very much, @halflearned!!!!