udellgroup / gcimputeR

Missing value imputation using Gaussian copula
8 stars 0 forks source link

Ximp_transform with extreme Z values #2

Closed boennecd closed 3 years ago

boennecd commented 3 years ago

This line may fail if Z is extremely small

https://github.com/udellgroup/mixedgcImp/blob/c2b5dd7cd2bd5c033d061e0de78367c47923440b/R/transform_fun.R#L70

E.g. R will return zero from pnorm when the Z entry is less than -37.5193. See https://github.com/SurajGupta/r-source/blob/master/src/nmath/pnorm.c#L262-L265.

yuxuanzhao2295 commented 3 years ago

Thanks for reminding. In practice, we haven't seen errors due to extreme Z values. The reason may be that we assume Z values follow standard normal distribution and construct the observation values of Z to approximately satisfy such assumption. As a result, the imputed entries of Z, as conditional mean of the observed entries, may hardly attain extreme values less than -37. Nevertheless, we improve on our algorithm by replacing the 0 entries in xmis_loc with 1, which will prevent the error from happening.

boennecd commented 3 years ago

It seems to happen in this case with the GSS data you use in one of your papers:

# get the GSS data
download.file(
  "https://github.com/yuxuanzhao2295/Missing-Value-Imputation-for-Mixed-Data-via-Gaussian-Copula/raw/master/dataset/GSS_2014_18var.RData",
  rdat_file <- tempfile())
load(rdat_file)

# function to mask data.
#
# Args:
#   dat: data set to use.
#   p_na: proportion which is NA.
#   do_mask: character with columns to mask. All columns are selected if this is
#   not NULL.
mask_dat <- function(dat, p_na, do_mask){
  # samples NA entries
  is_mask <- matrix(runif(NROW(dat) * NCOL(dat)) < p_na, NROW(dat))
  if(!is.null(do_mask)){
    stopifnot(length(setdiff(do_mask, colnames(dat))) == 0L)
    is_mask[, !colnames(dat) %in% do_mask] <- FALSE
  }

  # make sure we have no rows with all missing data
  while(any(all_nans <- rowSums(is_mask) == NCOL(is_mask)))
    is_mask[all_nans, ] <- runif(sum(all_nans) * NCOL(dat)) < p_na

  # mask data
  dat[is_mask] <- NA

  # remove unused levels and return
  droplevels(dat)
}

# use the function and fit the model
set.seed(62486992L)
dat_use <- mask_dat(GSS_2014_18var, p_na = .05,
                    do_mask = c("CLASS_", "LIFE", "HEALTH", "HAPPY", "RINCOME"))

fit <- mixedgcImp::impute_mixedgc(dat_use)
#R> Error in Xnew[miss_ind, j] <- sort(X[-miss_ind, j])[xmis_loc] : 
#R>   number of items to replace is not a multiple of replacement length
#R> In addition: Warning message:
#R> In em_mixedgc(Z_continuous = Z_continuous, r_lower = r_lower, r_upper = r_upper,  :
#R>   Max iter reached in EM

The cause is the close to singular estimated correlation matrix I guess. This is hard to get from the error message. I guess one could add a line with xmis_loc <- pmax(pmin(xmis_loc, (n-length(miss_ind))), 1L) or give a more informative error message earlier.

yuxuanzhao2295 commented 3 years ago

The reason of failed convergence is that the variable "WEEKSWRK" is regarded as a continuous variable under the default setting "nlevels = 20". When nlevels is set up as some number larger than the number of levels of "WEEKSWRK", say 60, the convergence can be achieved in no more than 20 iterations.

The improvement I made according to your reported error, xmis_loc[xmis_loc == 0] = 1, will not produce such error even when the convergence is not achieved as in your reported situation. You can re-install the latest version.

Thanks again for taking the time to report bugs!