Description of the bug
I noticed that when I was working with a file, I had changed the ordering of the X covariate data frame, and I started to get different estimates. Granted, it was not drastic. I kept the same seed.
For example, if you have 5 features, and you order the columns differently:
1 2 3 4 5
5 4 3 2 1
1 5 2 4 3
Will all of these always produce the same estimate?
Steps to reproduce
This is where I manually changed the ordering
namers<-names(X) #getting the names of the same file in a different order
X1<-X1 %>% dplyr::select(all_of(namers)) #reordered variables in covariate file.
Description of the bug I noticed that when I was working with a file, I had changed the ordering of the X covariate data frame, and I started to get different estimates. Granted, it was not drastic. I kept the same seed.
For example, if you have 5 features, and you order the columns differently:
1 2 3 4 5
5 4 3 2 1
1 5 2 4 3
Will all of these always produce the same estimate?
Steps to reproduce
This is where I manually changed the ordering
namers<-names(X) #getting the names of the same file in a different order X1<-X1 %>% dplyr::select(all_of(namers)) #reordered variables in covariate file.
estimate outcome of forest
Y.forest <- regression_forest(X = X1, Y = Y1, clusters = clus, equalize.cluster.weights = FALSE, seed = 1111)
orthogonalized Y
Y.hat <- predict(Y.forest)$predictions
estimate propensity forest
W.forest <- regression_forest(X = X1, Y = W1, clusters = clus, equalize.cluster.weights = FALSE, seed = 1111)
orthogonalized treatment
W.hat <- predict(W.forest)$predictions
estimate initial causal forest
cf.raw = causal_forest(X = X1, Y = Y1, W = W1, Y.hat = Y.hat, W.hat = W.hat, clusters = clus, equalize.cluster.weights = FALSE, seed = 1111)
find predictors that had greater than average importance
varimp <- variable_importance(cf.raw) selected.idx <- which(varimp > mean(varimp)) selected.idx2 <- which(colnames(X1)=='time_var') selected.idx3<-c(selected.idx,selected.idx2)
print(selected.idx3)
varimp <- data.frame(variable_importance(cf.raw)) varimp$names <- names(X1)
X2<-X1 X3<-X2 %>% dplyr::select(all_of(selected.idx3))
estimate final causal forest with most important predictors and tune parameters
cf <- causal_forest(X = X3, Y = Y1, W = W1, Y.hat = Y.hat, W.hat = W.hat, clusters = clus,
sample.weights = weight,
tau.hat <- predict(cf)$predictions
GRF version 2.3.0