satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
210 stars 33 forks source link

Error in base::tcrossprod(x, y) #30

Closed klprint closed 4 years ago

klprint commented 5 years ago

Hi! I have a dataset with three replicates and wanted to use scTransform. Unfortunately, as soon as I am adding the batch_var parameter, I get the error below, otherwise it is working, but still separates my batches on the UMAP.

Any suggestions?


n <- sctransform::vst(so@assays$RNA@counts, 
                                      cell_attr = so@meta.data, 
                                      batch_var = "orig.ident") 

Output:

Calculating cell attributes for input UMI matrix Variance stabilizing transformation of count matrix of size 15204 by 14123 
Model formula is y ~ (log_umi) : orig.ident + orig.ident + 0 
Get Negative Binomial regression parameters per gene Using 2000 genes, 14123 cells   \|=====================================================================================================================================\| 100% 

Second step: Get residuals using fitted parameters for 15204 genes   \|                                                                                                                                     \|   0%

Error in base::tcrossprod(x, y) : non-conformable arguments In addition: There were 50 or more warnings (use warnings() to see the first 50)
--

Traceback:

> traceback()
4: base::tcrossprod(x, y)
3: tcrossprod(model_pars_final[genes_bin, -1, drop = FALSE], regressor_data_final)
2: tcrossprod(model_pars_final[genes_bin, -1, drop = FALSE], regressor_data_final)
1: sctransform::vst(so@assays$RNA@counts, cell_attr = so@meta.data, 
       batch_var = "orig.ident")
ChristophH commented 5 years ago

Hello, this is hard to debug without a reproducible example. If you could provide the count matrix and the meta data, I will look into this error. In the meantime you could normalize each batch separately and then merge the data, or consider running an integration analysis (e.g as outlined in this Seurat vignette.

gokceneraslan commented 4 years ago

Dear @ChristophH ,

I think it's reproducible with these two files and the following code:

library(sctransform)

counts = read.csv('sct-debug-counts.csv.gz', row.names=1)
meta = read.csv('sct-debug-obs.csv.gz', row.names=1)
mat = as.matrix(t(counts))

vst_out = vst(mat,
    cell_attr=meta,
    batch_var=NULL,
    latent_var=c('log_umi', 'prep'),
    residual_type='pearson',
    return_cell_attr=T,
    min_cells=5,
    show_progress=T,
)

res_var = get_residual_var(vst_out, mat)
corrected = correct_counts(vst_out, mat)
res = vst_out$y

sct-debug-obs.csv.gz sct-debug-counts.csv.gz

Output is:

Calculating cell attributes for input UMI matrix

Variance stabilizing transformation of count matrix of size 1605 by 3744

Model formula is y ~ log_umi + prep

Get Negative Binomial regression parameters per gene

Using 1605 genes, 3744 cells

  |                                                                      |   0%

Warning message in theta.ml(y = y, mu = fit$fitted):
“iteration limit reached”
Warning message in theta.ml(y = y, mu = fit$fitted):
“iteration limit reached”
Warning message in theta.ml(y = y, mu = fit$fitted):
“iteration limit reached”
Warning message in theta.ml(y = y, mu = fit$fitted):
“iteration limit reached”
Warning message in theta.ml(y = y, mu = fit$fitted):
...

“iteration limit reached”
Warning message in theta.ml(y = y, mu = fit$fitted):
“iteration limit reached”

  |======================================================================| 100%

Found 140 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 1605 genes

  |======================================================================| 100%

Calculating gene attributes

Wall clock passed: Time difference of 55.74316 secs

Calculating variance for residuals of type pearson for 1605 genes

  |======================================================================| 100%

Warning message in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
“argument is not numeric or logical: returning NA”
Warning message in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
“argument is not numeric or logical: returning NA”
Computing corrected UMI count matrix

  |                                                                      |   0%

Error in base::tcrossprod(x, y): non-conformable arguments
Traceback:

1. correct_counts(vst_out, mat)
2. tcrossprod(coefs, regressor_data)
3. tcrossprod(coefs, regressor_data)
4. base::tcrossprod(x, y)
saketkc commented 4 years ago

@gokceneraslan Your prep appears to be a batch variable? In which case it should be modeled as such:

vst_out = vst(mat,
    cell_attr=meta,
    latent_var=c('log_umi'),
    batch_var=c('prep'),
    residual_type='pearson',
    return_cell_attr=T,
    min_cells=5,
    show_progress=T,
)
gokceneraslan commented 4 years ago

@saketkc It is not a batch variable but it's categorical. Do you mean we're not allowed to use categorical variables in the "latent_var" argument? If so, why is that? We can regress out any variable as long as it doesn't make the matrix singular, no? I thought batch_var just makes it an interaction term with latent_vars, so how come making it an interaction term solves the problem here?

ChristophH commented 4 years ago

@gokceneraslan Given your input data I can run the vst command without errors. plot_model_pars(vst_out) shows: image

But the correct_counts command fails

Computing corrected UMI count matrix
  |                                                                                               |   0%
Error in base::tcrossprod(x, y) : non-conformable arguments
In addition: Warning messages:
1: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :

 Error in base::tcrossprod(x, y) : non-conformable arguments 
4. base::tcrossprod(x, y) 
3. tcrossprod(coefs, regressor_data) 
2. tcrossprod(coefs, regressor_data) at denoise.R#176
1. correct_counts(vst_out, mat) 

vst() accepts categorical variables as latent_var. Whether this makes sense, since the parameters are regularized (see plot above), is a different question and depends on what the variable encodes. There is also a latent_var_nonreg parameter where regularization is not applied.

Note what correct_counts does:

Correct data by setting all latent factors to their median values and reversing the regression model

This was designed with only continuous variables in mind. I will look into how I should handle cases with categorical variables.

For now, you could do this:

vst_out = vst(mat,
              cell_attr=meta,
              batch_var=NULL,
              latent_var=c('log_umi'),
              latent_var_nonreg =c('prep'),
              residual_type='pearson',
              return_cell_attr=T,
              min_cells=5,
              show_progress=T,
)

And correct_counts should work.

gokceneraslan commented 4 years ago

Thanks so much @ChristophH ! Now I understand the reason behind the error. For now I think you can simply print an error if a categorical variable is passed to correct_counts, it's not a solution but it'd still be much better than the cryptic error message.