satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
207 stars 33 forks source link

vars.to.regress Error in qr.resid #8

Closed dpcook closed 5 years ago

dpcook commented 5 years ago

Hi @ChristophH,

Thanks for putting this package together. It's been fantastic so far!

I can't seem to find the source of an error I'm getting.

I'm taking a large seurat object, splitting it into several smaller objects based and trying to process each individually with SCTransform(). When I include vars.to.regress=c("S.Score", "G2M.Score"), it goes works for all of the subsets except a single one. The first part of the function works fine, but when it starts regressing out the scores with ScaleData, I get the following error:

subset_seurat <- SCTransform(subset_seurat, batch_var="Mix", variable.features.n = 1500, vars.to.regress=c("S.Score", "G2M.Score"), return_gene_attr=TRUE)

Regressing out S.Score, G2M.Score | | 0%Error in qr.resid(qr = qr, y = data.expr[x, ]) : NA/NaN/Inf in foreign function call (arg 5)

Again, it worked fine for 11 other subsets. I've confirmed that S.Score and G2M.Score are numeric. I've also tried omitting vars.to.regress from the SCTransform() call and performed ScaleData(vars.to.regress=...) separately on the resultant SCT assay and it worked without an error.

Here's the full output from the SCTransform() call: image

Any thoughts? Thanks!

David

ChristophH commented 5 years ago

Hi David, I have not seen this error before. Could you post the output of calling traceback() right after the error occurs?

A separate call (after SCTransform) to ScaleData(object = subset_seurat, vars.to.regress=..., do.scale = FALSE, do.center = FALSE, scale.max = Inf) works?

dpcook commented 5 years ago

Yeah, it's pretty strange. Not sure what's happening.

Here's traceback() right after SCTransform:

7: qr.resid(qr = qr, y = data.expr[x, ])
6: qr.resid(qr = qr, y = data.expr[x, ])
5: RegressOutMatrix(data.expr = object, latent.data = latent.data, 
       features.regress = features, model.use = model.use, use.umi = use.umi, 
       verbose = verbose)
4: ScaleData.default(scale.data, features = NULL, vars.to.regress = vars.to.regress, 
       latent.data = cell.attr[, vars.to.regress, drop = FALSE], 
       model.use = "linear", use.umi = FALSE, do.scale = do.scale, 
       do.center = do.center, scale.max = Inf, block.size = 256, 
       min.cells.to.block = 3000, verbose = verbose)
3: ScaleData(scale.data, features = NULL, vars.to.regress = vars.to.regress, 
       latent.data = cell.attr[, vars.to.regress, drop = FALSE], 
       model.use = "linear", use.umi = FALSE, do.scale = do.scale, 
       do.center = do.center, scale.max = Inf, block.size = 256, 
       min.cells.to.block = 3000, verbose = verbose)
2: SCTransform(subset_seurat, batch_var = "Mix", variable.features.n = 1500, 
       vars.to.regress = c("S.Score", "G2M.Score"), return_gene_attr = TRUE) at #22
1: getSeuratSubset("OVCA420", "TNF")

But if I remove only the vars.to.regress() option, the function works fine. Here's the whole function I'm running for reference (with vars.to.regress commented out for now):

getSeuratSubset <- function(cell_line, treatment){
  print(paste("Subsetting data to get",cell_line,"cells, treated with", treatment, "..."))
  cell_filter <- colnames(seurat)[which(seurat$CellLine == cell_line &
                                          (as.numeric(seurat$Cluster)-1) %in% 
                                          dplyr::filter(dominant_clusters, 
                                                        CellLine==cell_line)$Cluster)]
  #That hacky first part of the last line is just because the as.numeric converts 0-4 to 1-5
  genes_filter <- rownames(seurat[["RNA"]]@data)[rowSums(as.matrix(seurat[["RNA"]]@data)[, cell_filter]) > 0]

  print("Subsetting data by cell line")
  subset_seurat <- subset(seurat, cells=cell_filter, features=genes_filter)

  print("Re-classifying cell cycle") #likely cleanest on pure cell line
  subset_seurat <- CellCycleScoring(subset_seurat, s.features = s.genes, 
                                    g2m.features = g2m.genes, set.ident = FALSE)

  print("Subsetting data by treatment")
  cell_filter <- colnames(subset_seurat)[which(subset_seurat$Treatment == treatment)]
  genes_filter <- rownames(subset_seurat[["RNA"]]@data)[rowSums(as.matrix(subset_seurat[["RNA"]]@data)[, cell_filter]) > 0]
  subset_seurat <- subset(subset_seurat, cells=cell_filter, features=genes_filter)

  print("Normalizing with SCTransform...")
  subset_seurat <- SCTransform(subset_seurat, batch_var="Mix",
                               variable.features.n = 1500,
                               #vars.to.regress=c("S.Score", "G2M.Score"),
                               return_gene_attr=TRUE)

  print("Running PCA...")
  subset_seurat <- RunPCA(subset_seurat, verbose=F)

  print("Clustering...")
  subset_seurat <- FindNeighbors(subset_seurat, dims=1:15)
  subset_seurat <- FindClusters(subset_seurat, resolution=0.1)

  print("Running UMAP...")
  subset_seurat <- RunUMAP(subset_seurat, dims=1:15)
}

SCTransform finishes with no obvious problems:

[1] "Normalizing with SCTransform..."
Calculating cell attributes for input UMI matrix
Variance stabilizing transformation of count matrix of size 16635 by 2903
Model formula is y ~ (log_umi) : Mix + Mix + 0
Get Negative Binomial regression parameters per gene
Using 2000 genes, 2903 cells
  |======================================================================| 100%
Found 99 outliers - those will be ignored in fitting/regularization step

Second step: Pearson residuals using fitted parameters for 16635 genes
  |======================================================================| 100%
Computing corrected UMI count matrix
  |======================================================================| 100%
Calculating gene attributes
Wall clock passed: Time difference of 1.574278 mins
Placing corrected UMI matrix in counts slot
Determine variable features
Set 1500 variable features
Scaling data matrix
  |======================================================================| 100%
Setting default assay to SCT

And the UMAP is sensible image

Then, if I take the scale.data slot from the SCT assay and re-scale it with vars.to.regress set and the rest of the options matching what's run in the SCTransform function, it finishes with no problems:

> scale.data <- ovca420_tnf[["SCT"]]@scale.data
> scale.data <- ScaleData(scale.data, features = NULL, vars.to.regress = c("S.Score", "G2M.Score"), 
+                         latent.data = ovca420_tnf@meta.data[, c("S.Score", "G2M.Score"), drop = FALSE], 
+                         model.use = "linear", use.umi = FALSE, do.scale = FALSE, 
+                         do.center = TRUE, scale.max = Inf, block.size = 256, 
+                         min.cells.to.block = 3000, verbose = TRUE)
Regressing out S.Score, G2M.Score
  |========================================================================| 100%
Scaling data matrix
  |========================================================================| 100%

Then I can continue with

ovca420_tnf <- SetAssayData(ovca420_tnf, slot="scale.data", new.data=scale.data)
ovca420_tnf <- RunPCA(ovca420_tnf, verbose=FALSE)
ovca420_tnf <- FindNeighbors(ovca420_tnf, dims=1:15)
ovca420_tnf <- FindClusters(ovca420_tnf, resolution=0.1)
ovca420_tnf <- RunUMAP(ovca420_tnf, dims=1:15)

And all looks good image

So I really don't know what's causing the error--it's interesting. Running the ScaleData on the scale.data slot after SCTransform seems valid right?

If it makes sense to do that, I'm happy to use proceed with that instead. If you're interested in trying to dig into the bug, I'm happy to share an RDS of the subset that's causing issues, but I also understand if you'd rather just hold off and see if it happens to others. Just let me know!

David

dpcook commented 5 years ago

Just an update

It looks like there are about ~3000 genes that become NA after the SCTransform call for this one dataset, but not the others. The genes are detectable (ie. are not all zeros), albeit poorly. Here's an example from one of the genes

range(as.matrix(ovca420_tnf[["RNA"]]@counts["WNT9A",]))
[1] 0 3

table(as.matrix(ovca420_tnf[["RNA"]]@counts["WNT9A",]) > 0)
FALSE  TRUE 
 2856    47 

table(is.na(as.matrix(ovca420_tnf[["RNA"]]@data["WNT9A",])))
FALSE 
 2903 

table(is.na(as.matrix(ovca420_tnf[["SCT"]]@counts["WNT9A",])))
TRUE 
2903 

If I remove these genes prior to running SCTransform, it runs without any problems even with vars.to.regress set.

ChristophH commented 5 years ago

Thanks for looking into this. 47 non-zero observations out of ca. 3K is not that low.

If you could share an example of you input and parameters, I will look into what exactly is going. Maybe it's a bug, and if it's data-specific I could at least show an appropriate warning.

dpcook commented 5 years ago

Happy to help!

Here's a Google Drive link to an RDS file (~3Gb) with the Seurat object for this particular subset. I saved the RDS immediately before the SCTransform() call that throws the error:

seurat_subset <- SCTransform(seurat_subset,  batch_var="Mix",
                               do.correct.umi=TRUE,
                               vars.to.regress=c("S.Score", "G2M.Score"),
                               variable.features.n = 1500,
                               return_gene_attr=T)

To provide context, I have 10x libraries that were a bunch of samples multiplexed together. The processing strategy was to take the full seurat object, subset it by cell line, perform cell cycle classification, subset it further to only have cells from a specific treatment group, and then normalize.

#Initial filtering
cell_filter <- colnames(seurat)[which(seurat$CellLine == "OVCA420" &
                                          (as.numeric(seurat$Cluster)-1) %in% 
                                          dplyr::filter(dominant_clusters, 
                                                        CellLine=="OVCA420")$Cluster)]
genes_filter <- rownames(seurat[["RNA"]]@data)[rowSums(as.matrix(seurat[["RNA"]]@data)[, cell_filter]) > 0]

ovca420 <- subset(seurat, cells=cell_filter, features=genes_filter)
#Cell cycle scoring on cell line data
ovca420 <- CellCycleScoring(ovca420, s.features = s.genes, 
                            g2m.features = g2m.genes, set.ident = FALSE)

#Subset down to treatment
cell_filter <- colnames(ovca420)[which(ovca420$Treatment == "TNF")]
genes_filter <- rownames(ovca420[["RNA"]]@data)[rowSums(as.matrix(ovca420[["RNA"]]@data)[, cell_filter]) > 0]
ovca420_tnf <- subset(ovca420, cells=cell_filter, features=genes_filter)

The gene filter was just to remove genes that were sufficiently detected in the whole dataset, but maybe not at all in this particular subset.

I should also mention that I do get "iteration limit reached" warnings through parts of the SCTransform() run, but I've seen that across subsets--not just this one.

Just let me know if you have any questions about it!

ChristophH commented 5 years ago

It looks like the problem is that some genes are not detected at all in some of the batches (Mix). For example, WNT9A is not detected in Mix1, and as a result no Mix1-specific model parameters are assigned during regularization. I'll think about a fix and how to warn the user.

Note that you would only need to include a batch.var term if you see strong batch effects, otherwise it may be better to not set that parameter at all. Also, see my reply to #3 regarding how the batch indicator variable in sctransform::vst does not replace an integration analysis as implemented in Seurat.

dpcook commented 5 years ago

Ah okay--that makes sense. I should have thought to look at levels across batch. Thanks for tracking that down! I appreciate it.

Without the batch.var option, I definitely see batch-associated patterns on the embeddings and setting batch.var seems to clean it up pretty well. The "Mix" variable corresponds to replicate plates of the same cell populations and treatments, some which were processed on different days. Considering that, I figured it made sense to include this as a covariate by setting batch.var. Do you think integration is more appropriate, or is it really just whichever works best?

Thanks again

ChristophH commented 5 years ago

I've pushed a fix addressing this issue. Now, if there are genes that are not detected in some of the batches, I assume a low mean and proceed with parameter regularization.

In your case, using batch_var might be sufficient. If you perform the analysis separately on batches and see the same results as on the combined set, then this indicates that a simple per-gene offset is enough to deal with your batch effects.

You can always compare to the integration analysis as outlined in the Seurat v3 vignette (that is not using sctransform, but working very well and tested on many datasets).