Closed ccmullally closed 2 years ago
Dear @ccmullally,
Thank you for your interest in our package. Based on your description, this error indeed seems to be strange. Would it be possible to provide a replicable example of the bug so that I can look into it? Thank you in advance!
Best, Max
I'm not able to replicate the issue using data bundled with R. I guess it's just a quirk of that one column, for whatever reason. Do you have any idea what the error means? I looked in your source code and did not see any clues. If not, no worries. Thanks for your help!
You're welcome! Could you please share the arguments you passed to GenericML()
? What could have happened is that there is an error or warning in the fitting of one or more machine learners, which causes the output of the mlr3
framework to be different to what an internal function within GenericML()
expects. For instance, if there is perfect collinearity in the covariate matrix Z
, linear models will break down, so we cannot make predictions with them anymore. Hence, a corrupted (possibly empty) prediction object will be returned by the mlr3
framework for that particular learner, which ultimately causes the error in GenericML()
. I will try to emulate such a behavior to see if I can replicate this error.
In addition, could you maybe try to see if the error persists for different choices of machine learners? If it only occurs when there is a specific machine learner in the set of considered learners, then this might serve as (sort of) evidence that the problem is caused by an issue in the fitting of that particular learner rather than a bug in GenericML()
.
Sorry for the delay. You were right about sensitivity to learners. I eliminated the lasso and everything runs fine. I'm not sure if it is because of collinearity. I created a copy of a column in the Z matrix, removed the original offending column, and everything ran fine. I've included the arguments below in case you want to see them, but the issue appears to be how glmnet is handling things when the problematic column is included.
OutcomeList <- data.frame(rbind("female_business", "female_business_profit", "fem_hours_hhprod",
"empowerment", "no_role_index", "major_role_index", "some_role_index",
"land_house", "production", "dailly_activity"))
#OutcomeList <- data.frame(rbind("no_role_index2014", "major_role_index2014"))
D <- as.numeric(data$program) #treatment
Z <- as.matrix(cbind(data$dependent,data$workingagemale,data$workingagefemale,data$edu_head,data$age_head,data$sex_head,data$concrete_wall,data$distance_mrkt,data$concrete_floor,data$sanitary,data$owned_land,data$cow,data$goat,data$chicken)) # vector of X variables
Z_CLAN <- Z #heteorgeneity checking var;CLAN
cluster <- data$bocd #cluster
################################################################################
#GenerericML parameters
# quantile cutoffs for the GATES grouping of the estimated CATEs
quantile_cutoffs <- c(.33, 0.66) # 3 quantiles
# specify the learner of the propensity score; 0.50 as RCT.
learner_propensity_score <- rep(0.5, length(D))
# specify the considered learners of the BCA and the CATE (here: lasso, random forest, and SVM)
learners_GenericML <- c("mlr3::lrn('ranger', num.trees = 100)", "mlr3::lrn('svm')")
# specify the number of splits
num_splits <- 3
# specify if a HT transformation shall be used when estimating BLP and GATES
HT <- FALSE
# A list controlling the variables that shall be used in the matrix X1 for the BLP and GATES regressions.
X1_BLP <- setup_X1()
X1_GATES <- setup_X1()
# consider differences between group K (most affected) with groups 1 and 2, respectively.
diff_GATES <- setup_diff(subtract_from = "most",
subtracted = c(1,2))
diff_CLAN <- setup_diff(subtract_from = "most",
subtracted = c(1,2))
# specify the significance level
significance_level <- 0.05
# specify minimum variation of predictions before Gaussian noise with variance var(Y)/20 is added.
min_variation <- 1e-05
# specify cluster-robust standard errors via vcovCL() in 'sandwich'
# and pass the clusters as arguments
vcov_BLP <- setup_vcov(estimator = "vcovCL", arguments = list(cluster = cluster))
vcov_GATES <- vcov_BLP # same for GATES
# specify whether of not it should be assumed that the group variances of the most and least affected groups are equal in CLAN.
equal_variances_CLAN <- FALSE
# specify the proportion of samples that shall be selected in the auxiliary set
prop_aux <- 0.5
# specify whether or not the splits and auxiliary results of the learners shall be stored
store_splits <- TRUE
store_learners <- TRUE
# parallelization options (currently only supported on Unix systems)
parallel <- FALSE
#num_cores <- 4 # 4 cores
seed <- 123456
Thank you! I remember that in one of my past projects, glmnet
also threw a cryptic error that I didn't really understand. Collinearity in the data matrix wasn't the issue. I don't remember how I fixed this issue, but perhaps choosing a different value of the alpha
parameter could do the job. For instance, for alpha = 0.5
, which leads to an elastic-net regularization penalty as in Zou and Hastie (2005), you would need to specify the learner as follows: 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 0.5)'
.
Overall, the problem seems to be caused by unanticipated behavior of glmnet
. I don't know if this is a bug in glmnet
, but I don't think that I can solve this issue within the GenericML
package. I hope that choosing a different value for the alpha
parameter resolves the problem so that you can at least use one regularized linear regression estimator. I'm truly sorry if this is not a satisfactory fix, but I'm afraid that's all I can do without reaching out to the glmnet
maintainer.
You've been a huge help, thanks!
On Mon, Feb 7, 2022, 3:10 AM Max Welz @.***> wrote:
Thank you! I remember that in one of my past projects, glmnet also threw a cryptic error that I didn't really understand. Collinearity in the data matrix wasn't the issue. I don't remember how I fixed this issue, but perhaps choosing a different value of the alpha parameter could do the job. For instance, for alpha = 0.5, which leads to an elastic-net regularization penalty as in Zou and Hastie (2005) https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x, you would need to specify the learner as follows: 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 0.5)'
Overall, the problem seems to be caused by unanticipated behavior of glmnet. I don't know if this is a bug in glmnet, but I don't think that I can solve this issue within the GenericML package. I hope that choosing a different value for the alpha parameter resolves the problem so that you can at least use one regularized linear regression estimator. I'm truly sorry if this is not a satisfactory fix, but I'm afraid that's all I can do without reaching out to the glmnet maintainer.
— Reply to this email directly, view it on GitHub https://github.com/mwelz/GenericML/issues/9#issuecomment-1031180024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYJV47ELF2EW3IGZGG2HSLUZ5475ANCNFSM5NSO3QGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
You're welcome!
I love the package. But I am running into the following error:
Error in generic_targets[[i]]$GATES[, , s] <- generic.ml.obj$GATES$generic_targets : number of items to replace is not a multiple of replacement length
Strangely, if I only do 2 splits, everything runs fine. If I run more than 2, it fails. Also, the error goes away completely when I drop a specific column from the Z matrix. The code runs fine without that column, for more splits than 2. There appears to be nothing special about the column; no missing values, plenty of variation. What might cause that error message? Happy to provide code and data. Thank you.
Here is the output from traceback():