Closed halflearned closed 6 years ago
Hi @halflearned, thank you for the detailed issue and the very clear reproduction!
I agree that the difference is likely due to our sampling strategy for selecting split variables. I did some digging, and came across what seems like the same issue in ranger (the project whose code we're originally based on): https://github.com/imbs-hl/ranger/issues/151. In short, we're using Knuth's algorithm, which preserves each variable's original order when producing a subset. Since each variable is considered in order when deciding what to split on, variables that appear first tend to be split on more frequently.
We'll plan to incorporate the same fix that was made in ranger, which is to switch to the Fisher Yates algorithm for sampling. I think the change should be pretty do-able for a newcomer, and ranger already provides an example of a fix. Is this something you'd be interested in picking up?
Thanks @halflearned, good find!
Hey @jtibshirani and @swager! The ranger issue linked above seems to be exactly the problem. I'll try to submit a patch by early next week.
Folks, quick update. I've been working on this since last week, but things are a bit more complicated than I anticipated. As far as I can see, fixing the functions in the same way that ranger
did it seems not to solve the issue entirely.
In fact, it may be the case that ranger
actually never really solved their bug either.
I did a similar experiment using ranger
: fix the number of regressors to p=20
and some 1 <= mtry <= 20
, and feed the forest with regressors in two different orders. The performance is different depending on the order, although the sign and magnitude depend on mtry
.
The code is in this gist and the results are the following.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
factor(mtry)1 | -0.048621 | 0.00079 | -61.575256 | 0.000000 |
factor(mtry)2 | 0.020944 | 0.00079 | 26.523676 | 0.000000 |
factor(mtry)3 | 0.021915 | 0.00079 | 27.754351 | 0.000000 |
factor(mtry)4 | 0.012012 | 0.00079 | 15.212834 | 0.000000 |
factor(mtry)5 | 0.029685 | 0.00079 | 37.593601 | 0.000000 |
factor(mtry)6 | 0.051650 | 0.00079 | 65.411244 | 0.000000 |
factor(mtry)7 | 0.046597 | 0.00079 | 59.012008 | 0.000000 |
factor(mtry)8 | 0.047763 | 0.00079 | 60.488170 | 0.000000 |
factor(mtry)9 | 0.098814 | 0.00079 | 125.141493 | 0.000000 |
factor(mtry)10 | 0.045762 | 0.00079 | 57.953985 | 0.000000 |
factor(mtry)11 | 0.033852 | 0.00079 | 42.871581 | 0.000000 |
factor(mtry)12 | 0.041696 | 0.00079 | 52.804769 | 0.000000 |
factor(mtry)13 | -0.015003 | 0.00079 | -19.000184 | 0.000000 |
factor(mtry)14 | 0.000018 | 0.00079 | 0.022355 | 0.982164 |
factor(mtry)15 | 0.004213 | 0.00079 | 5.335602 | 0.000000 |
factor(mtry)16 | -0.012750 | 0.00079 | -16.147290 | 0.000000 |
factor(mtry)17 | -0.014471 | 0.00079 | -18.326560 | 0.000000 |
factor(mtry)18 | 0.002406 | 0.00079 | 3.047162 | 0.002311 |
factor(mtry)19 | 0.000844 | 0.00079 | 1.069380 | 0.284901 |
factor(mtry)20 | 0.000675 | 0.00079 | 0.854686 | 0.392727 |
To me, the numbers above I so bizarre that I think there might be a problem with my gist. If anyone has the time to take a look at it, I'd appreciate it.
Meanwhile, I'll continue probing to see what is the problem. I've started contemplating the possibility that my C++ compiler is not producing uniformly random numbers, which would screw up our sampling results.
Hi @halflearned, thanks for the update! I took a look at the gist and noticed you were hard-coding the random seed to 1234
(likely left over from some debugging). Without those lines, do you see more reasonable results?
Sorry for the delay, looks like I was overcomplicating things. PR at #259.
--
By the way, @jtibshirani was right: after removing the hardcoded seed, ranger
also produced fine results. But to be honest I'm not sure why that was wrong.
Since the same random seed is used each iteration, it seems you're essentially running the same experiment n_sim = 5000
times, and averaging the results. I think under these conditions, we would expect to see a few significant differences in predictions just by chance.
Hm, I don't know, since the on line 34 I "reset" the seed.
rm(.Random.seed, envir=globalenv())
But it's okay, we don't have to continue this discussion here :)
Fixed in #259
I apologize for reopening a closed issue, but is there a way to impose the Fisher-Yates sampling that is implemented in #259? I ask because I have a recent version of grf (0.10.4) in which I still observe that covariate order matters for regression forest predictions, and it seems that Fisher-Yates only kicks in if some condition is met, which I infer from this code in RandomSampler.cpp
:
void RandomSampler::draw(std::vector<size_t>& result,
size_t max,
const std::set<size_t>& skip,
size_t num_samples) {
if (num_samples < max / 10) {
draw_simple(result, max, skip, num_samples);
} else {
draw_fisher_yates(result, max, skip, num_samples);
}
}
To be totally clear, I don't know enough c++ to know what num_samples
and max
would refer to in the options I pass to regression_forest
, so I can't even be sure that the problem I am seeing is due to this, or something else. However, my problem is comparably small to the one described in the above gist (n ~ 1000, p ~ 10), so my hypothesis is that if I am seeing a covariate order depend in my results, it must be this. Is my problem that 10 columns is too few for Fisher-Yates to matter?
Thanks in advance for any suggestions you have!
Can you include an example?
Hi! I think I'm facing the same problem.
I'm running an instrumental_forest and, even setting seeds, if I change the order of regressors, the results changes.
I'm not used to this level of discussion, so I apologize in advance for any mistake.
I cut part of my database and paste the code I'm using. If I change the order of "GOV_EFFECTIVENESS_LAGGED" and "GDP_PPP_LAGGED", the predicitons chage, as you can see in the table.
test dataset https://drive.google.com/file/d/126dfk9iDwDr0QxVBBfFfUU_Vj-CfPvQD/view?usp=sharing
test <- read_csv("test.CSV")
external_factors <- c("GOV_EFFECTIVENESS_LAGGED", "OOP_PPP_LAGGED", "GDP_PPP_LAGGED")
Y <- test$DELTA_RATE_U5
X <- dplyr::select(test, external_factors)
W <- test$PUBLIC_EXP_LAGGED
Z <- test$TAX_PPP_LAGGED
Y.forest <- regression_forest(X, Y, seed = 233)
Y.hat <- predict(Y.forest)$predictions
W.forest <- regression_forest (X , W, seed = 233)
W.hat <- predict(W.forest)$predictions
Z.forest <- regression_forest(X, Z, seed = 233)
Z.hat <- predict(Z.forest)$predictions
iv_raw <- instrumental_forest(X, Y, W, Z,
Y.hat, W.hat, Z.hat,
mtry = sqrt(ncol(X)),
num.trees = 10000,
honesty = T,
compute.oob.predictions = TRUE,
seed = 233)
varimp <- variable_importance(iv_raw)
selected_idx <- which(varimp > mean(varimp))
iv <- instrumental_forest(X[,selected_idx], Y, W, Z,
Y.hat, W.hat, Z.hat,
mtry = sqrt(ncol(X)),
num.trees = 10000,
honesty = T,
compute.oob.predictions = TRUE,
seed = 233)
result <- predict(iv, newdata = NULL, estimate.variance = TRUE, set.seed(233))
result
10th Predictions
"GOV_EFFECTIVENESS_LAGGED", "OOP_PPP_LAGGED", "GDP_PPP_LAGGED" | "GDP_PPP_LAGGED","OOP_PPP_LAGGED", "GOV_EFFECTIVENESS_LAGGED" |
---|---|
-7.27798E-05 | -5.8594E-05 |
-6.73034E-05 | -5.74381E-05 |
-6.8214E-05 | -5.71053E-05 |
-6.86638E-05 | -5.96793E-05 |
-6.60211E-05 | -5.46936E-05 |
-3.24378E-05 | -2.51731E-05 |
-3.51665E-05 | -2.89873E-05 |
-6.4615E-05 | -5.65659E-05 |
-6.32857E-05 | -5.39281E-05 |
-6.54502E-05 | -5.4964E-05 |
Hi @tcovert and @lpgarcia.
It's not possible for grf
results to remain the same if the covariate order is changed. This means that there's no reason to expect, for example, that two forests fitted as
tau.forest1 <- causal_forest(X[,c(1,2,3)], Y, W, seed=1234)
tau.forest2 <- causal_forest(X[,c(2,1,3)], Y, W, seed=1234) # switched order!
should give the exact same predictions, even though they share the same seed, because there is no way for grf to "know" that I changed the covariate order when creating the trees. This has nothing to do with which sampling algorithm is used.
If this is a concern, I recommend that you simply try growing a larger number of trees. By averaging over many trees, the effect of covariate order should vanish.
Thank you very much, @halflearned!!!!
TL;DR The output of regression_forest seems to depend on covariate order. In one example below, reversing the order of covariates in the matrix caused a ~40% performance loss.
Example overview
This minimal example reproduces the bug for different initial parameters.
We create a matrix
X
of size (n, p). Only the first regressor,X[,1]
is relevant. In fact, it is equal toY
. Other columns are just noise.We train the
regression_forest
on(X, Y)
twice, with the same seed. The second time around, we feedX
with its columns in reverse order (X[,p]
, ...,X[,1]
with the relevant regressor coming in last).Results
The performance , as measured by in-sample MSE, can be very different. The table below shows how bad the performance difference can be, for different hyper-parameters and datasets sizes. We call the usual order "A" and the reversed order "B".
Note how the dependence on
mtry
is also quite bizarre -- see the numbers in bold below. This suggests a problem in how variables are being randomly selected.You can also modify the file above so that all other columns of
X
are not just noise, but constant, and the results remain the same.