thomaswiemann / ddml

ddml: Double/Debiased Machine Learning in R
https://thomaswiemann.com/ddml/
GNU General Public License v3.0
15 stars 1 forks source link

`crossval` returns residuals in permuted order #54

Closed helske closed 9 months ago

helske commented 9 months ago

It seems that the out-of-sample residuals returned by crossval function are permuted instead of them matching the order of the original data. So if you have say two folds 1 2, 5, 6 and 3 4 7 8 the residuals corresponds to observations 1 2 5 6 3 4 7 8 and not 1 to 8:


set.seed(1)
n <- 10
X <- rnorm(n)
Y <- X + rnorm(n)
data <- data.frame(Y, X)

# split data to two folds
idx <- ddml:::generate_subsamples(n, 2)
out_of_sample_residuals_cv <- numeric(n)
for(i in 1:length(idx)) {
  idx_i <- idx[[i]]
  fit <- lm(Y ~ -1 + X, data = data[-idx_i,])
  out_of_sample_residuals_cv[idx_i] <- data$Y[idx_i] - predict(fit, newdata = data[idx_i,])
}
# same with crossval
fit <- crossval(matrix(data$Y, ncol = 1), matrix(data$X),
                learners = list(list(fun = ols)),
                silent = TRUE,
                cv_subsamples = idx)
c(fit$oos_resid)
# [1] -1.6403818 -1.0455853  1.8443046  1.5234501  0.2214465  1.9249814  0.2687147 -3.2669249  0.9075922 -0.3376917
out_of_sample_residuals_cv
# [1]  1.9249814  0.2687147 -1.6403818 -3.2669249  0.9075922 -1.0455853 -0.3376917  1.8443046  1.5234501  0.2214465
c(fit$oos_resid)[order(unlist(idx))]
# [1]  1.9249814  0.2687147 -1.6403818 -3.2669249  0.9075922 -1.0455853 -0.3376917  1.8443046  1.5234501  0.2214465

This is very problematic when using the residuals in subsequent modelling where we have also other variables which are still in the original order. A simple fix would be to change the line oos_resid <- unlist(cv_res) to oos_resid <- unlist(cv_res)[order(unlist(cv_subsamples))]

Or if that would break something, at least a note on the documentation regarding this would be welcome (docs mention chronological order but I read it so that it means that the columns are ordered in order of appearance of the learners), and/or maybe argument which can be used to control this behaviour?

Edit: The example was messed up, now fixed.

thomaswiemann commented 9 months ago

Thanks @helske for catching this bug & providing the fix too!