Closed boennecd closed 3 years ago
Thanks for reminding. In practice, we haven't seen errors due to extreme Z values. The reason may be that we assume Z values follow standard normal distribution and construct the observation values of Z to approximately satisfy such assumption. As a result, the imputed entries of Z, as conditional mean of the observed entries, may hardly attain extreme values less than -37. Nevertheless, we improve on our algorithm by replacing the 0 entries in xmis_loc with 1, which will prevent the error from happening.
It seems to happen in this case with the GSS data you use in one of your papers:
# get the GSS data
download.file(
"https://github.com/yuxuanzhao2295/Missing-Value-Imputation-for-Mixed-Data-via-Gaussian-Copula/raw/master/dataset/GSS_2014_18var.RData",
rdat_file <- tempfile())
load(rdat_file)
# function to mask data.
#
# Args:
# dat: data set to use.
# p_na: proportion which is NA.
# do_mask: character with columns to mask. All columns are selected if this is
# not NULL.
mask_dat <- function(dat, p_na, do_mask){
# samples NA entries
is_mask <- matrix(runif(NROW(dat) * NCOL(dat)) < p_na, NROW(dat))
if(!is.null(do_mask)){
stopifnot(length(setdiff(do_mask, colnames(dat))) == 0L)
is_mask[, !colnames(dat) %in% do_mask] <- FALSE
}
# make sure we have no rows with all missing data
while(any(all_nans <- rowSums(is_mask) == NCOL(is_mask)))
is_mask[all_nans, ] <- runif(sum(all_nans) * NCOL(dat)) < p_na
# mask data
dat[is_mask] <- NA
# remove unused levels and return
droplevels(dat)
}
# use the function and fit the model
set.seed(62486992L)
dat_use <- mask_dat(GSS_2014_18var, p_na = .05,
do_mask = c("CLASS_", "LIFE", "HEALTH", "HAPPY", "RINCOME"))
fit <- mixedgcImp::impute_mixedgc(dat_use)
#R> Error in Xnew[miss_ind, j] <- sort(X[-miss_ind, j])[xmis_loc] :
#R> number of items to replace is not a multiple of replacement length
#R> In addition: Warning message:
#R> In em_mixedgc(Z_continuous = Z_continuous, r_lower = r_lower, r_upper = r_upper, :
#R> Max iter reached in EM
The cause is the close to singular estimated correlation matrix I guess. This is hard to get from the error message. I guess one could add a line with xmis_loc <- pmax(pmin(xmis_loc, (n-length(miss_ind))), 1L)
or give a more informative error message earlier.
The reason of failed convergence is that the variable "WEEKSWRK" is regarded as a continuous variable under the default setting "nlevels = 20". When nlevels is set up as some number larger than the number of levels of "WEEKSWRK", say 60, the convergence can be achieved in no more than 20 iterations.
The improvement I made according to your reported error, xmis_loc[xmis_loc == 0] = 1,
will not produce such error even when the convergence is not achieved as in your reported situation. You can re-install the latest version.
Thanks again for taking the time to report bugs!
This line may fail if Z is extremely small
https://github.com/udellgroup/mixedgcImp/blob/c2b5dd7cd2bd5c033d061e0de78367c47923440b/R/transform_fun.R#L70
E.g. R will return zero from
pnorm
when the Z entry is less than -37.5193. See https://github.com/SurajGupta/r-source/blob/master/src/nmath/pnorm.c#L262-L265.