kreutz-lab / DIMAR

Data-driven selction of an imputation algorithm in R
4 stars 0 forks source link

Error in stats::glm.fit #3

Closed aretaon closed 2 years ago

aretaon commented 2 years ago

Hi,

I am trying to use DIMAR on a data set with 30 cols, 7800 rows of log-transformed intensities and around 30% missing values for most columns (60% for some columns). While this works well for the top 900 rows (with <10% missing values), I get the following error for the linear model when I process the whole data set:

Error in stats::glm.fit(design$X, design$y, family = stats::binomial()) : NA/NaN/Inf in 'x' Calls: dimaFunction -> -> Execution halted

As far as I understand, DIMA learns the patterns of MVs in the data set and therefore should be agnostic to factors such as the presence of MVs as such and the variance of the data points. Is that correct?

For clarity, dimaFunction is my rewrite of https://github.com/kreutz-lab/DIMAR/blob/main/R/dimar.R to get access to the performance table:

dimaFunction <- function(dfv) {

  # default args
  methods <- strsplit(x = args[5], split = ',')[[1]]
  npat <- args[6]
  group <- strsplit(x = args[4], split = ',')[[1]]
  performanceMetric <- args[7]

  print(paste0('args: ', format(args)))

  mtx <- as.matrix(dfv)

  print(noquote('colnames:'))
  print(format(colnames(mtx)))

  print(noquote('group :'))
  print(format(group))

  # from here: copied from
  # https://github.com/kreutz-lab/DIMAR/blob/b8e9497cd490a464bf46ebc177efd7e34f064723/R/dimar.R

  if (!group[1] == 'cluster') {
    groupidx <- rep(0L,ncol(mtx))
    groupidx[grepl(group[1],colnames(mtx))] <- 1
    groupidx[grepl(group[2],colnames(mtx))] <- 2
    group <- groupidx
  }

  mtx <- DIMAR::dimarMatrixPreparation(mtx, nacut = 2)

  coef <- DIMAR::dimarLearnPattern(mtx)
  ref <- DIMAR::dimarConstructReferenceData(mtx)
  sim <- DIMAR::dimarAssignPattern(ref, coef, mtx, npat)

  Imputations <- DIMAR::dimarDoImputations(sim, methods)
  Performance <- DIMAR::dimarEvaluatePerformance(Imputations, ref, sim, performanceMetric, TRUE, group)
  Imp <- DIMAR::dimarDoOptimalImputation(mtx, rownames(Performance))

  dfv <- as.data.frame(Imp)
  dfv$UID <- rownames(dfv)
  write.table(dfv, output, sep='\t')
  write.table(Performance, paste(str_sub(output, end=-5), '_performance.csv', sep=""), sep='\t')
}

Could you point me in a direction where this error might come from?

Thanks a lot! Julian

JanineEgert commented 2 years ago

Hi Julian, thanks for the comment. The error seems to come from the R glm.fit function. My first guess would be to look if there are any Inf or Nan values in your data set (may be due to logging your data). My second guess is that the logistic regression could run into convergence problem. Maybe you can find out a short example of the occuring error, then I can have a look at it. Best, Janine

aretaon commented 2 years ago

Hi Janine, thanks for coming back to this. I just repeated the error with another smaller data set. This data actually contains NaNs (<12% MVs per column) as I want to perform imputation on them. However, I thought that this is not a problem as DIMAR::dimarLearnPattern takes care that not the actual NaN values are supplied to glm.fit but rather their positions encoded through dimarConstructDesignMatrix. Is that correct?

Interestingly, DIMA runs fine when I use the first or last 1000 rows of the dataframe. However, it breaks once I use 1001 rows, possibly pointing at the case distinction in DIMAR::dimarLearnPattern. Do you have any idea how I can investigate this closer?

Thanks for your help, Julian

Summary of the variable mtx:

 log10_LFQ.intensity.Cneg.8h log10_LFQ.intensity.Cneg.16h
 Min.   :5.215               Min.   :5.425               
 1st Qu.:7.126               1st Qu.:7.122               
 Median :7.522               Median :7.521               
 Mean   :7.581               Mean   :7.582               
 3rd Qu.:7.995               3rd Qu.:8.004               
 Max.   :9.908               Max.   :9.898               
 NA's   :333                 NA's   :331                 
 log10_LFQ.intensity.Cneg.1day log10_LFQ.intensity.Cneg.2days
 Min.   :5.424                 Min.   :5.525                 
 1st Qu.:7.125                 1st Qu.:7.134                 
 Median :7.518                 Median :7.515                 
 Mean   :7.577                 Mean   :7.576                 
 3rd Qu.:7.989                 3rd Qu.:7.971                 
 Max.   :9.898                 Max.   :9.864                 
 NA's   :341                   NA's   :304                   
 log10_LFQ.intensity.Cneg.3days log10_LFQ.intensity.COX1.KD.8h
 Min.   :5.519                  Min.   :5.599                 
 1st Qu.:7.129                  1st Qu.:7.127                 
 Median :7.510                  Median :7.517                 
 Mean   :7.577                  Mean   :7.583                 
 3rd Qu.:7.965                  3rd Qu.:7.990                 
 Max.   :9.816                  Max.   :9.927                 
 NA's   :307                    NA's   :376                   
 log10_LFQ.intensity.COX1.KD.16h log10_LFQ.intensity.COX1.KD.1day
 Min.   :5.475                   Min.   :5.633                   
 1st Qu.:7.142                   1st Qu.:7.196                   
 Median :7.523                   Median :7.578                   
 Mean   :7.588                   Mean   :7.642                   
 3rd Qu.:7.999                   3rd Qu.:8.038                   
 Max.   :9.875                   Max.   :9.972                   
 NA's   :357                     NA's   :603                     
 log10_LFQ.intensity.COX1.KD.2days log10_LFQ.intensity.COX1.KD.3days
 Min.   :5.524                     Min.   :5.567                    
 1st Qu.:7.195                     1st Qu.:7.204                    
 Median :7.569                     Median :7.565                    
 Mean   :7.632                     Mean   :7.627                    
 3rd Qu.:8.025                     3rd Qu.:8.008                    
 Max.   :9.863                     Max.   :9.935                    
 NA's   :460                       NA's   :453             

As an example, the first few rows/cols of mtx (returned by print(format(mtx)) ) look like this:

[1] mtx:
     log10_LFQ.intensity.Cneg.8h log10_LFQ.intensity.Cneg.16h
1    "7.073425"                  "6.942633"                  
2    "7.734840"                  "7.752218"                  
3    "7.350345"                  "7.440311"                  
4    "7.595022"                  "7.718734"                  
5    "6.846158"                  "      NA"                  
6    "8.460251"                  "8.401659"                  
7    "7.416923"                  "7.471526"                  
8    "7.489874"                  "7.540792"                  
9    "8.015653"                  "7.998063"                  
10   "7.573382"                  "      NA"                  
11   "7.444310"                  "7.260548"                  
12   "7.798450"                  "7.718659"                  
13   "7.589324"                  "7.828428"                  
14   "7.354012"                  "7.112538"                  
15   "6.231419"                  "      NA"                  
16   "7.172924"                  "6.931920"                  
17   "6.639905"                  "6.715502"                  
18   "6.799085"                  "6.770609"                  
19   "7.046651"                  "7.086289"                  
20   "      NA"                  "      NA"                  
21   "7.765177"                  "7.889828"                  
22   "6.652701"                  "6.756347"                  
23   "7.329601"                  "7.329744"                  
24   "7.067852"                  "7.065244"                  
25   "6.713440"                  "6.794955"                  
26   "7.188844"                  "7.095971"                  
27   "6.305803"                  "      NA"           
aretaon commented 2 years ago

Hi Janine, I investigated the problem with glm.fit a bit closer and I think I found the culprit. As suspected, the case distinction in DIMAR::dimarLearnPattern caused the problem as a matrix with an odd number of rows (e.g. 1001 row) will be divided into two even subset (e.g. npersub=501) rows.

  if (nrow(mtx) > 1000) {
    nsub <- ceiling(nrow(mtx)/1000)
    indrand <- sample(1:nrow(mtx), nrow(mtx))
    npersub <- ceiling(nrow(mtx)/nsub)
  }

Within the subset loop, npersup is used for indexing and during the second iteration with my example data set, indrand will be indexed up to 1002 (which does not exist). Therefore, R will include a single NA value in the indices which messes up glm.fit later.

  if (nsub > 1) {
    ind <- indrand[(npersub * (i - 1) + 1):(npersub * 
      i)]
  }

I wrote a small test script (see below) in which I changed the code of DIMAR::dimarLearnPattern such that the row indices will not exceed the maximum rows. I also merged the resulting fit coefficients of different lengths by setting the vector length to the longer vector. This creates a NA value for the shorter vector which is not considered for the colMean anyway and is removed for the sort function. It would be great if you could double check my code (I also attach the the data subset input.csv used for testing) and include it into DIMAR::dimarLearnPattern if you think this makes sense.

Cheers

Julian

df <- read.table("input.csv", sep='\t', header=TRUE)
rownames(df) <- df$UID
dfv <- df[,-which(names(df) %in% c("UID"))]
mtx <- as.matrix(dfv)

mtx <- DIMAR::dimarMatrixPreparation(mtx, nacut = 2)

dimarLearnPatternNew <- function (mtx) 
{
  if (nrow(mtx) > 1000) {
    nsub <- ceiling(nrow(mtx)/1000)
    indrand <- sample(1:nrow(mtx), nrow(mtx))
    npersub <- ceiling(nrow(mtx)/nsub)
  } else {
    nsub <- 1
    ind <- 1:nrow(mtx)
  }

  for (i in 1:nsub) {
    if (nsub > 1) {
      beginrange <- (npersub * (i - 1) + 1)
      endrange <- (npersub * i)

      if (endrange > length(indrand)) {
        endrange <- length(indrand)
      }

      ind <- indrand[beginrange:endrange]
    }

    design <- DIMAR::dimarConstructDesignMatrix(mtx[ind, ])
    design <- DIMAR::dimarConstructRegularizationMatrix(design)
    fit <- stats::glm.fit(design$X, design$y, family = stats::binomial())

    if (i == 1) {
      coef <- stats::coefficients(fit)
      firstdesign <- design
    } else {
      coef_tmp <- stats::coefficients(fit)
      length(coef_tmp) <- max(length(coef_tmp), length(coef))
      coef <- rbind(coef, coef_tmp)
    }

  }

  if (nsub > 1) {
    coef <- c(colMeans(coef[, firstdesign$Xtype != 3]), sort(coef[,firstdesign$Xtype == 3], na.last = NA))
  }

  print("Pattern of MVs is learned by logistic regression.")
  return(coef)
}

coef <- dimarLearnPatternNew(mtx)
JanineEgert commented 2 years ago

Hi Julian, thanks for your effort. True, the error message results from an indexing problem when dividing the data into smaller data sets (for a faster computation of the logistic regression). I've corrected the bug in the source code. Your attempt also seems to work. Best, Janine