It seems that the out-of-sample residuals returned by crossval function are permuted instead of them matching the order of the original data. So if you have say two folds 1 2, 5, 6 and 3 4 7 8 the residuals corresponds to observations 1 2 5 6 3 4 7 8 and not 1 to 8:
set.seed(1)
n <- 10
X <- rnorm(n)
Y <- X + rnorm(n)
data <- data.frame(Y, X)
# split data to two folds
idx <- ddml:::generate_subsamples(n, 2)
out_of_sample_residuals_cv <- numeric(n)
for(i in 1:length(idx)) {
idx_i <- idx[[i]]
fit <- lm(Y ~ -1 + X, data = data[-idx_i,])
out_of_sample_residuals_cv[idx_i] <- data$Y[idx_i] - predict(fit, newdata = data[idx_i,])
}
# same with crossval
fit <- crossval(matrix(data$Y, ncol = 1), matrix(data$X),
learners = list(list(fun = ols)),
silent = TRUE,
cv_subsamples = idx)
c(fit$oos_resid)
# [1] -1.6403818 -1.0455853 1.8443046 1.5234501 0.2214465 1.9249814 0.2687147 -3.2669249 0.9075922 -0.3376917
out_of_sample_residuals_cv
# [1] 1.9249814 0.2687147 -1.6403818 -3.2669249 0.9075922 -1.0455853 -0.3376917 1.8443046 1.5234501 0.2214465
c(fit$oos_resid)[order(unlist(idx))]
# [1] 1.9249814 0.2687147 -1.6403818 -3.2669249 0.9075922 -1.0455853 -0.3376917 1.8443046 1.5234501 0.2214465
This is very problematic when using the residuals in subsequent modelling where we have also other variables which are still in the original order. A simple fix would be to change the line
oos_resid <- unlist(cv_res)
to
oos_resid <- unlist(cv_res)[order(unlist(cv_subsamples))]
Or if that would break something, at least a note on the documentation regarding this would be welcome (docs mention chronological order but I read it so that it means that the columns are ordered in order of appearance of the learners), and/or maybe argument which can be used to control this behaviour?
It seems that the out-of-sample residuals returned by
crossval
function are permuted instead of them matching the order of the original data. So if you have say two folds1 2, 5, 6
and3 4 7 8
the residuals corresponds to observations1 2 5 6 3 4 7 8
and not 1 to 8:This is very problematic when using the residuals in subsequent modelling where we have also other variables which are still in the original order. A simple fix would be to change the line
oos_resid <- unlist(cv_res)
tooos_resid <- unlist(cv_res)[order(unlist(cv_subsamples))]
Or if that would break something, at least a note on the documentation regarding this would be welcome (docs mention chronological order but I read it so that it means that the columns are ordered in order of appearance of the learners), and/or maybe argument which can be used to control this behaviour?
Edit: The example was messed up, now fixed.