satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
206 stars 33 forks source link

Error: Feature names of counts matrix cannot be empty #44

Closed mariafiruleva closed 4 years ago

mariafiruleva commented 4 years ago

Hi @ChristophH @satijalab

Could you help me to solve this problem, please?

I've checked the source of error, and I see that one gene within _vst.out$umicorrected object inside SCTransform function has an empty name. However, before SCTransform running all feature names are correct.

RData can be downloaded in PanglaoDB.

Code:

library(dplyr)
library(functools)
library(ggplot2)
library(gridExtra)
library(Matrix)
library(sctransform)
library(Seurat)

## FUNCTIONS

add_metadata <- function(data) {
  mito.genes <-
    grep(pattern = "^Mt\\.|^MT\\.|^mt\\.|^Mt-|^MT-|^mt-",
         x = rownames(x = GetAssayData(object = data)),
         value = TRUE)
  percent.mito <-
    Matrix::colSums(GetAssayData(object = data, slot = "counts")[mito.genes, ]) /
    Matrix::colSums(GetAssayData(object = data, slot = "counts"))
  data[['percent.mito']] <- percent.mito
  data[['percent.mito_log10']] <- log10(data[['percent.mito']])
  data[['nCount_RNA_log10']] <- log10(data[['nCount_RNA']])
  data[['nFeature_RNA_log10']] <- log10(data[['nFeature_RNA']])
  data[['nCount_RNA_log2']] <- log2(data[['nCount_RNA']])
  data[['nFeature_RNA_log2']] <- log2(data[['nFeature_RNA']])
  data[['scaled_mito']] <- scale(percent.mito)
  data[['scaled_nCount_RNA']] <- scale(data[['nCount_RNA_log10']])
  data
}

get_conf_interval <- function(dataset, parameter) {
  left <- mean(dataset[[parameter]][[1]]) - qnorm(0.975)
  right <- mean(dataset[[parameter]][[1]]) + qnorm(0.975)
  return(c(left, right))
}

## GATHERING DATA TOGETHER

fdata <- get(load("/path/to/SRA701877_SRS3279685.sparse.RData"))
fdata@Dimnames[[1]] <-
  make.names(gsub(".ENS.*", "", fdata@Dimnames[[1]]), unique = T)
whole <- CreateSeuratObject(
  counts = fdata,
  min.cells = 2,
  min.features = 200,
  project = "SRA701877"
)
whole <- add_metadata(whole)

# FILTER MT CONTENT
whole <-
  subset(
    x = whole,
    subset = scaled_mito < get_conf_interval(whole, 'scaled_mito')[2]
  )

## NORMALIZATION
whole <-
  Seurat::SCTransform(
    whole,
    ncells=min(100000, ncol(whole)),
    vars.to.regress = c("percent.mito"),
    verbose = T,
    conserve.memory = T
  )

As a result, I receive the error:

Place corrected count matrix in counts slot Error: Feature names of counts matrix cannot be empty In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

SessionInfo:

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /scratch/opt/R/3.6.0/lib/R/lib/libRblas.so
LAPACK: /scratch/opt/R/3.6.0/lib/R/lib/libRlapack.so

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rlang_0.4.0       Seurat_3.1.0      sctransform_0.2.0 Matrix_1.2-17    
[5] gridExtra_2.3     ggplot2_3.2.1     functools_0.2.0   dplyr_0.8.3      

loaded via a namespace (and not attached):
 [1] tsne_0.1-3          nlme_3.1-139        bitops_1.0-6        RcppAnnoy_0.0.12   
 [5] RColorBrewer_1.1-2  httr_1.4.1          tools_3.6.0         backports_1.1.4    
 [9] R6_2.4.0            irlba_2.3.3         KernSmooth_2.23-15  uwot_0.1.3         
[13] lazyeval_0.2.2      colorspace_1.4-1    npsurv_0.4-0        withr_2.1.2        
[17] tidyselect_0.2.5    compiler_3.6.0      plotly_4.9.0        labeling_0.3       
[21] caTools_1.17.1.2    scales_1.0.0        lmtest_0.9-37       ggridges_0.5.1     
[25] pbapply_1.4-2       stringr_1.4.0       digest_0.6.20       R.utils_2.9.0      
[29] pkgconfig_2.0.2     htmltools_0.3.6     bibtex_0.4.2        htmlwidgets_1.3    
[33] rstudioapi_0.10     zoo_1.8-6           jsonlite_1.6        ica_1.0-2          
[37] gtools_3.8.1        R.oo_1.22.0         magrittr_1.5        Rcpp_1.0.2         
[41] munsell_0.5.0       ape_5.3             reticulate_1.12     lifecycle_0.1.0    
[45] R.methodsS3_1.7.1   stringi_1.4.3       gbRd_0.4-11         MASS_7.3-51.4      
[49] gplots_3.0.1.1      Rtsne_0.15          plyr_1.8.4          grid_3.6.0         
[53] parallel_3.6.0      gdata_2.18.0        listenv_0.7.0       ggrepel_0.8.1      
[57] crayon_1.3.4        lattice_0.20-38     cowplot_1.0.0       splines_3.6.0      
[61] SDMTools_1.1-221.1  zeallot_0.1.0       pillar_1.4.2        igraph_1.2.4.1     
[65] future.apply_1.3.0  reshape2_1.4.3      codetools_0.2-16    leiden_0.3.1       
[69] glue_1.3.1          lsei_1.2-0          metap_1.1           RcppParallel_4.4.3 
[73] data.table_1.12.2   vctrs_0.2.0         png_0.1-7           Rdpack_0.11-0      
[77] gtable_0.3.0        RANN_2.6.1          purrr_0.3.2         tidyr_1.0.0        
[81] future_1.14.0       assertthat_0.2.1    rsvd_1.0.2          survival_2.44-1.1  
[85] viridisLite_0.3.0   tibble_2.1.3        cluster_2.0.8       globals_0.12.4     
[89] fitdistrplus_1.0-14 ROCR_1.0-7

Best wishes, Maria

mariafiruleva commented 4 years ago

Source of my problem is a correct_counts function. If length of genes_bin vector inside for-loop equals to one (which means that genes_bin is the last gene in genes vector), all vectors inside for-loop are unnamed and y.res matrix has no rownames too. I'm not sure whether my actions are correct or not, but if I change the correct_counts function like this, I have no problems with SCTransform output.

correct_counts <- function(x, umi, cell_attr = x$cell_attr, show_progress = TRUE) {
  regressor_data_orig <- model.matrix(as.formula(gsub('^y', '', x$model_str)), cell_attr)
  # when correcting, set all latent variables to median values
  cell_attr[, x$arguments$latent_var] <- apply(cell_attr[, x$arguments$latent_var, drop=FALSE], 2, function(x) rep(median(x), length(x)))
  regressor_data <- model.matrix(as.formula(gsub('^y', '', x$model_str)), cell_attr)

  genes <- rownames(umi)[rownames(umi) %in% rownames(x$model_pars_fit)]
  bin_size <- x$arguments$bin_size
  bin_ind <- ceiling(x = 1:length(x = genes) / bin_size)
  max_bin <- max(bin_ind)
  if (show_progress) {
    message('Computing corrected UMI count matrix')
    pb <- txtProgressBar(min = 0, max = max_bin, style = 3)
  }
  #corrected_data <- matrix(NA_real_, length(genes), nrow(regressor_data), dimnames = list(genes, rownames(regressor_data)))
  corrected_data <- list()
  for (i in 1:max_bin) {
    genes_bin <- genes[bin_ind == i]
    coefs <- x$model_pars_fit[genes_bin, -1]
    theta <- x$model_pars_fit[genes_bin, 1]
    # get pearson residuals
    mu <- exp(tcrossprod(coefs, regressor_data_orig))
    variance <- mu + mu^2 / theta
    y <- as.matrix(umi[genes_bin, , drop=FALSE])
    pearson_residual <- (y - mu) / sqrt(variance)
    # generate output
    mu <- exp(tcrossprod(coefs, regressor_data))
    variance <- mu + mu^2 / theta
    y.res <- mu + pearson_residual * sqrt(variance)
    y.res <- round(y.res, 0)
    y.res[y.res < 0] <- 0
    if (length(theta) == 1) {
      rownames(y.res) <- genes_bin
    }
    corrected_data[[length(corrected_data) + 1]] <- as(y.res, Class = 'dgCMatrix')
    if (show_progress) {
      setTxtProgressBar(pb, i)
    }
  }
  if (show_progress) {
    close(pb)
  }
  corrected_data <- do.call(what = rbind, args = corrected_data)
  return(corrected_data)
}

I'm looking forward to your reply.

Best whishes, Maria

ChristophH commented 4 years ago

Hi Maria, Thank you for pointing out that bug. Should be fixed with commit 70be235 in develop.