ChristopherStephens21 commented 11 months ago

I am performing routine scRNAseq analysis on a single cell dataset using the scTransform framework. and have run into a weird issue. Briefly, running the code below exactly reproduces the results of the SCTransform vignette, with clear differences in the UMAP plot reflecting cell cycle or mitochondrial percentage regression (Here I regressed out mitochondrial gene percentage). However, when I run this script with my own data by substituting out the path name in line 1 with that for my own data, regressing out percent.mt or cell cycle score has no effect on the downstream UMAP or PCA plots. I'm not sure what to make of this, but my dataset is considerably larger, with around 9000 cells. I have also been able to reproduce these results with other datasets.

Example 1: With Regression

pbmc_data <- Seurat::Read10X(data.dir = FilePath) pbmc <- CreateSeuratObject(counts = pbmc_data) s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes pbmc <- NormalizeData(pbmc) pbmc <- FindVariableFeatures(pbmc, selection.method = "vst") s.genes <- cc.genes.updated.2019$s.genes #Cell cycle markers loaded from Seurat s.genes <- sapply(s.genes, str_to_title) g2m.genes <- cc.genes.updated.2019$g2m.genes #Separating into S and G2M markers g2m.genes <- sapply(g2m.genes, str_to_title) pbmc <- CellCycleScoring(pbmc, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) pbmc <- PercentageFeatureSet(pbmc, pattern = "^mt-", col.name = "percent.mt") pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt", "S.Score", "G2M.Score"), vst.flavor = "v2", method = "glmGamPoi", verbose = T) pbmc <- RunPCA(pbmc, verbose = T) pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = T) pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = T) pbmc <- FindClusters(pbmc, verbose = T)

Example 2: Without Regression

pbmc2 <- CreateSeuratObject(counts = pbmc_data) pbmc2 <- NormalizeData(pbmc2) pbmc2 <- FindVariableFeatures(pbmc2, selection.method = "vst") pbmc2 <- CellCycleScoring(pbmc2, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) pbmc2 <- PercentageFeatureSet(pbmc2, pattern = "^mt-", col.name = "percent.mt") pbmc2 <- SCTransform(pbmc2, vst.flavor = "v2", method = "glmGamPoi", verbose = T) pbmc2 <- RunPCA(pbmc2, verbose = T) pbmc2 <- RunUMAP(pbmc2, dims = 1:30, verbose = T) pbmc2 <- FindNeighbors(pbmc2, dims = 1:30, verbose = T) pbmc2 <- FindClusters(pbmc2, verbose = T) Plot1 <- DimPlot(pbmc, label = TRUE) + ggtitle("With MT and CC Regression") Plot2 <- DimPlot(pbmc2, label = TRUE) + ggtitle("Without MT and CC Regression") Plot1 + Plot2 identical(pbmc@meta.data, pbmc2@meta.data)

pbmc_data <- Seurat::Read10X(data.dir = FilePath) pbmc <- CreateSeuratObject(counts = pbmc_data) s.genes <- cc.genes$s.genes g2m.genes <- cc.genes$g2m.genes pbmc <- NormalizeData(pbmc) Normalizing layer: counts Performing log-normalization 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| pbmc <- FindVariableFeatures(pbmc, selection.method = "vst") Finding variable features for layer counts Calculating gene variances 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Calculating feature variances of standardized and clipped values 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| s.genes <- cc.genes.updated.2019$s.genes #Cell cycle markers loaded from Seurat s.genes <- sapply(s.genes, str_to_title) g2m.genes <- cc.genes.updated.2019$g2m.genes #Separating into S and G2M markers g2m.genes <- sapply(g2m.genes, str_to_title) pbmc <- CellCycleScoring(pbmc, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) pbmc <- PercentageFeatureSet(pbmc, pattern = "^mt-", col.name = "percent.mt") pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt", "S.Score", "G2M.Score"), vst.flavor = "v2", method = "glmGamPoi", verbose = T) Running SCTransform on assay: RNA Running SCTransform on layer: counts vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes. Variance stabilizing transformation of count matrix of size 22452 by 8924 Model formula is y ~ log_umi Get Negative Binomial regression parameters per gene Using 2000 genes, 5000 cells Found 93 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22452 genes Computing corrected count matrix for 22452 genes Calculating gene attributes Wall clock passed: Time difference of 23.15094 secs Determine variable features Regressing out percent.mt, S.Score, G2M.Score |=================================================================================================================================================================================================================================================================================| 100% Centering data matrix |=================================================================================================================================================================================================================================================================================| 100% Getting residuals for block 1(of 2) for counts dataset Getting residuals for block 2(of 2) for counts dataset Centering data matrix |=================================================================================================================================================================================================================================================================================| 100% Finished calculating residuals for counts Set default assay to SCT

pbmc <- RunPCA(pbmc, verbose = T) PC 1 Positive: Col1a1, Col1a2, Col3a1, Tnfaip6, Dlk1, Cxcl1, Cxcl2, Meg3, Ccl2, Sparc Itm2a, Tnc, Cdkn1c, Mfap4, Lum, Gm32647, Fstl1, Aspn, Dcn, Mest Col6a1, Col5a2, Dlc1, Thbs1, Ebf1, Ccl7, Nrk, Igf2, Pcolce, Ccn1 Negative: Krt14, Krt15, Krt5, Krt1, Ly6d, Krt10, Lgals7, Cstdc5, Stfa3, Krt17 Krtdap, Perp, Dmkn, Stfa1, Sfn, Dsp, Srgn, Sox6, Grip1, Cpa3 Hdc, Cma1, Lgals3, Tpsb2, mt-Cytb, Calm4, mt-Atp6, Fcer1g, Bmpr1b, S100a14 PC 2 Positive: Krt14, Krt15, Krt1, Krt5, Ly6d, Krt10, Lgals7, Stfa3, Cstdc5, Krt17 Krtdap, Perp, Dmkn, Sfn, Dsp, Grip1, Fabp5, Sox6, Stfa1, Calm4 Cxcl14, S100a14, Bmpr1b, Fras1, Fgfbp1, Ptma, Tspear, Trp63, Dsc3, Serpinb5 Negative: Srgn, Fcer1g, Hdc, Cpa3, Cma1, Tpsb2, Serpinb1a, Tyrobp, Plek, Slc6a4 Csf2rb, Cxcl2, Gata2, Fxyd5, Mcpt4, C1qb, Calca, Slco2b1, Mrc1, Cd14 Cyp11a1, Pf4, C1qc, Slc18a2, C1qa, F13a1, Ccl7, Otulinl, Ccl4, Rac2 PC 3 Positive: C1qb, Pf4, Mrc1, C1qc, C1qa, F13a1, Cxcl2, Cd14, Ccl12, Cd86 Ccl4, C5ar1, Il1b, Csf1r, Apoe, Ccl7, Ccl3, Slfn2, Lyz2, Fcrls Lyve1, Selenop, Aoah, Ptprj, Ifi207, Nlrp3, Cd83, Hmox1, Ms4a6c, Ms4a7 Negative: Cpa3, Cma1, Hdc, Tpsb2, Serpinb1a, Slc6a4, Gata2, Srgn, Csf2rb, Calca Mcpt4, Cyp11a1, Slc18a2, Hs6st2, Rab27b, Itk, Slco2b1, Hs3st1, Fdx1, Il13 Otulinl, St6galnac3, Il4, Il1rl1, Fxyd5, Ctla2a, Mrgprb1, Csf2, Edil3, Adora3 PC 4 Positive: Krt1, Krt10, Ly6d, Stfa3, Cstdc5, Krtdap, Col1a1, Col1a2, Stfa1, Lgals7 Tnfaip6, Krt15, Dlk1, Perp, Fabp5, Dmkn, Lgals3, Krt14, Col3a1, Cxcl2 Krt5, Npl, Sfn, Itm2a, Cxcl1, Mfap4, Dsp, Dsg1a, Calm4, Krt16 Negative: Dct, Pmel, Ptgds, Tyr, Cpne4, Trps1, Pax3, Chsy3, Fmn1, Mlana Pde3a, Ednrb, Col23a1, Rarb, Robo2, Acta2, Ebf1, Aff3, Rgs5, Fstl4 Synpr, Ptprj, Prex2, Sobp, Ctnna3, Myh11, Myl9, Syt4, Tagln, Mcoln3 PC_ 5 Positive: Dlk1, Dct, Pmel, Ptgds, Tyr, Itm2a, Mest, Cpne4, Chsy3, Mfap4 Pax3, Nrk, Cdkn1c, Clec3b, Aspn, Postn, Mfap5, Fmn1, Acta2, Tmeff2 Mlana, Igfbp7, Dpysl3, Col14a1, Ednrb, Krt1, Rgs5, Igf2, Krt10, Myh11 Negative: Gm32647, Robo2, Trps1, Col3a1, Col1a1, Egfl6, Cntn4, Cxcl2, Col1a2, Lum Wif1, Slit3, Twist2, Ccl2, Grem2, Tmem132c, Tnc, Col23a1, Zfp536, Lsamp Twist1, Cped1, Rspo3, Lgals1, St6galnac3, Cntn5, Itih5, Pth1r, Hecw2, Crabp1 pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = T) 16:41:19 UMAP embedding parameters a = 0.9922 b = 1.112 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 16:41:19 Read 8924 rows and found 30 numeric columns 16:41:19 Using Annoy for neighbor search, n_neighbors = 30 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 16:41:19 Building Annoy index with metric = cosine, n_trees = 50 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| 16:41:19 Writing NN index file to temp file /var/folders/r7/psn_5h9s3ql986wjn3b1sxj90yv6n8/T//RtmpQyNO0H/file147d22f35981e 16:41:19 Searching Annoy index using 1 thread, search_k = 3000 16:41:21 Annoy recall = 100% 16:41:21 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 16:41:22 Initializing from normalized Laplacian + noise (using RSpectra) 16:41:22 Commencing optimization for 500 epochs, with 351934 positive edges Using method 'umap' 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| 16:41:30 Optimization finished pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = T) Computing nearest neighbor graph Computing SNN pbmc <- FindClusters(pbmc, verbose = T) Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8924 Number of edges: 274391

Running Louvain algorithm... 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Maximum modularity in 10 random starts: 0.9218 Number of communities: 25 Elapsed time: 0 seconds

pbmc2 <- CreateSeuratObject(counts = pbmc_data) pbmc2 <- NormalizeData(pbmc2) Normalizing layer: counts Performing log-normalization 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| pbmc2 <- FindVariableFeatures(pbmc2, selection.method = "vst") Finding variable features for layer counts Calculating gene variances 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Calculating feature variances of standardized and clipped values 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| pbmc2 <- CellCycleScoring(pbmc2, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE) pbmc2 <- PercentageFeatureSet(pbmc2, pattern = "^MT-", col.name = "percent.mt") pbmc2 <- SCTransform(pbmc2, vst.flavor = "v2", method = "glmGamPoi", verbose = T) Running SCTransform on assay: RNA Running SCTransform on layer: counts vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes. Variance stabilizing transformation of count matrix of size 22452 by 8924 Model formula is y ~ log_umi Get Negative Binomial regression parameters per gene Using 2000 genes, 5000 cells Found 93 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22452 genes Computing corrected count matrix for 22452 genes Calculating gene attributes Wall clock passed: Time difference of 22.24379 secs Determine variable features Centering data matrix |=================================================================================================================================================================================================================================================================================| 100% Getting residuals for block 1(of 2) for counts dataset Getting residuals for block 2(of 2) for counts dataset Centering data matrix |=================================================================================================================================================================================================================================================================================| 100% Finished calculating residuals for counts Set default assay to SCT

pbmc2 <- RunPCA(pbmc2, verbose = T) PC 1 Positive: Col1a1, Col1a2, Col3a1, Tnfaip6, Dlk1, Cxcl1, Cxcl2, Meg3, Ccl2, Sparc Itm2a, Tnc, Cdkn1c, Mfap4, Lum, Gm32647, Fstl1, Aspn, Dcn, Mest Col6a1, Col5a2, Dlc1, Thbs1, Ebf1, Ccl7, Nrk, Igf2, Pcolce, Ccn1 Negative: Krt14, Krt15, Krt5, Krt1, Ly6d, Krt10, Lgals7, Cstdc5, Stfa3, Krt17 Krtdap, Perp, Dmkn, Stfa1, Sfn, Dsp, Srgn, Sox6, Grip1, Cpa3 Hdc, Cma1, Lgals3, Tpsb2, mt-Cytb, Calm4, mt-Atp6, Fcer1g, Bmpr1b, S100a14 PC 2 Positive: Krt14, Krt15, Krt1, Krt5, Ly6d, Krt10, Lgals7, Stfa3, Cstdc5, Krt17 Krtdap, Perp, Dmkn, Sfn, Dsp, Grip1, Fabp5, Sox6, Stfa1, Calm4 Cxcl14, S100a14, Bmpr1b, Fras1, Fgfbp1, Ptma, Tspear, Trp63, Dsc3, Serpinb5 Negative: Srgn, Fcer1g, Hdc, Cpa3, Cma1, Tpsb2, Serpinb1a, Tyrobp, Plek, Slc6a4 Csf2rb, Cxcl2, Gata2, Fxyd5, Mcpt4, C1qb, Calca, Slco2b1, Mrc1, Cd14 Cyp11a1, Pf4, C1qc, Slc18a2, C1qa, F13a1, Ccl7, Otulinl, Ccl4, Rac2 PC 3 Positive: C1qb, Pf4, Mrc1, C1qc, C1qa, F13a1, Cxcl2, Cd14, Ccl12, Cd86 Ccl4, C5ar1, Il1b, Csf1r, Apoe, Ccl7, Ccl3, Slfn2, Lyz2, Fcrls Lyve1, Selenop, Aoah, Ptprj, Ifi207, Nlrp3, Cd83, Hmox1, Ms4a6c, Ms4a7 Negative: Cpa3, Cma1, Hdc, Tpsb2, Serpinb1a, Slc6a4, Gata2, Srgn, Csf2rb, Calca Mcpt4, Cyp11a1, Slc18a2, Hs6st2, Rab27b, Itk, Slco2b1, Hs3st1, Fdx1, Il13 Otulinl, St6galnac3, Il4, Il1rl1, Fxyd5, Ctla2a, Mrgprb1, Csf2, Edil3, Adora3 PC 4 Positive: Krt1, Krt10, Ly6d, Stfa3, Cstdc5, Krtdap, Col1a1, Col1a2, Stfa1, Lgals7 Tnfaip6, Krt15, Dlk1, Perp, Fabp5, Dmkn, Lgals3, Krt14, Col3a1, Cxcl2 Krt5, Npl, Sfn, Itm2a, Cxcl1, Mfap4, Dsp, Dsg1a, Calm4, Krt16 Negative: Dct, Pmel, Ptgds, Tyr, Cpne4, Trps1, Pax3, Chsy3, Fmn1, Mlana Pde3a, Ednrb, Col23a1, Rarb, Robo2, Acta2, Ebf1, Aff3, Rgs5, Fstl4 Synpr, Ptprj, Prex2, Sobp, Ctnna3, Myh11, Myl9, Syt4, Tagln, Mcoln3 PC_ 5 Positive: Dlk1, Dct, Pmel, Ptgds, Tyr, Itm2a, Mest, Cpne4, Chsy3, Mfap4 Pax3, Nrk, Cdkn1c, Clec3b, Aspn, Postn, Mfap5, Fmn1, Acta2, Tmeff2 Mlana, Igfbp7, Dpysl3, Col14a1, Ednrb, Krt1, Rgs5, Igf2, Krt10, Myh11 Negative: Gm32647, Robo2, Trps1, Col3a1, Col1a1, Egfl6, Cntn4, Cxcl2, Col1a2, Lum Wif1, Slit3, Twist2, Ccl2, Grem2, Tmem132c, Tnc, Col23a1, Zfp536, Lsamp Twist1, Cped1, Rspo3, Lgals1, St6galnac3, Cntn5, Itih5, Pth1r, Hecw2, Crabp1 pbmc2 <- RunUMAP(pbmc2, dims = 1:30, verbose = T) 16:42:27 UMAP embedding parameters a = 0.9922 b = 1.112 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 16:42:27 Read 8924 rows and found 30 numeric columns 16:42:27 Using Annoy for neighbor search, n_neighbors = 30 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 16:42:27 Building Annoy index with metric = cosine, n_trees = 50 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| 16:42:27 Writing NN index file to temp file /var/folders/r7/psn_5h9s3ql986wjn3b1sxj90yv6n8/T//RtmpQyNO0H/file147d26795f1c8 16:42:27 Searching Annoy index using 1 thread, search_k = 3000 16:42:29 Annoy recall = 100% 16:42:29 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 16:42:30 Initializing from normalized Laplacian + noise (using RSpectra) 16:42:30 Commencing optimization for 500 epochs, with 351934 positive edges Using method 'umap' 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| 16:42:38 Optimization finished pbmc2 <- FindNeighbors(pbmc2, dims = 1:30, verbose = T) Computing nearest neighbor graph Computing SNN pbmc2 <- FindClusters(pbmc2, verbose = T) Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8924 Number of edges: 274391

Running Louvain algorithm... 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Maximum modularity in 10 random starts: 0.9218 Number of communities: 25 Elapsed time: 0 seconds

Plot1 <- DimPlot(pbmc, label = TRUE) + ggtitle("With MT and CC Regression") Plot2 <- DimPlot(pbmc2, label = TRUE) + ggtitle("Without MT and CC Regression") Plot1 + Plot2 identical(pbmc@meta.data, pbmc2@meta.data) [1] TRUE
# insert reproducible example here

ChristopherStephens21 commented 11 months ago

I am running the analysis on a Mac (MacOS Ventura 13.3.1). Code for cell cycle regression and mitochondrial gene percentage was changed from the pbmc code to reflect differences in gene name formatting for my mouse datasets, but no other changes were made between running it on the pbmc data and my own.

saketkc commented 11 months ago

Thanks for catching this @ChristopherStephens21, I will push a fix.

saketkc commented 10 months ago

I have pushed a fix here https://github.com/satijalab/seurat/commit/64a6495ec17759d2a7513c706871b9167dece01c You can install the develop branch to test:

remotes::install_github("sataijalab/seurat", ref="develop")

Hope this helps!

WayGW commented 10 months ago

I was having similar observations to the original post. However, after installing the developer version it appears that it no longer is able to grab the values from the meta.data as I get the following message.

pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt"), vst.flavor = "v2", method = "glmGamPoi", verbose = T) Running SCTransform on assay: RNA Running SCTransform on layer: counts vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes. Variance stabilizing transformation of count matrix of size 19037 by 9723 Model formula is y ~ log_umi Get Negative Binomial regression parameters per gene Using 2000 genes, 5000 cells Found 233 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 19037 genes Computing corrected count matrix for 19037 genes Calculating gene attributes Wall clock passed: Time difference of 49.41705 secs Determine variable features Regressing out percent.mt |============================================================================================================| 100% Centering data matrix |============================================================================================================| 100% Getting residuals for block 1(of 2) for counts dataset Getting residuals for block 2(of 2) for counts dataset Error: None of the requested variables to regress are present in the object.

Bildungsluecke commented 9 months ago

Hi, I'm having the same issue as @Greysun109 Regression of mitochondrial gene content by SCTransform using vars.to.regress="percent.mito" leads to the same UMAP as without regression.

Unfortunately, installing develop branch as suggested by @saketkc did not resolve the issue.

I'm also getting the same error message as @Greysun109 (Error: None of the requested variables to regress are present in the object.) although the variable is included in the metadata slot of the seurat object.

I'm on macOS Ventura 13.6.1

CroixJeremy2 commented 9 months ago

Hello, I am having the same issue as @Bildungsluecke . I have installed your version using

remotes::install_github("sataijalab/seurat", ref="develop")

but I have this error at the end of SCTransform() calculations:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset
Error: None of the requested variables to regress are present in the object.

Before your version, I had the same problem as @ChristopherStephens21 , same UMAP from regressed out samples running SCTransform()

Here is my sessionInfo() if it can be useful:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] C

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DoubletFinder_2.0.3 dplyr_1.1.4         MetaDE_2.2.3       
 [4] glmGamPoi_1.14.0    presto_1.0.0        data.table_1.14.10 
 [7] Rcpp_1.0.12         scCustomize_2.0.1   ggplot2_3.4.4      
[10] Seurat_5.0.1.9003   SeuratObject_5.0.1  sp_2.1-2           

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.3.1              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] tibble_3.2.1                polyclip_1.10-6            
  [7] janitor_2.2.0               fastDummies_1.7.3          
  [9] lifecycle_1.0.4             edgeR_4.0.12               
 [11] hdf5r_1.3.9                 globals_0.16.2             
 [13] lattice_0.22-5              MASS_7.3-60.0.1            
 [15] magrittr_2.0.3              openxlsx_4.2.5.2           
 [17] limma_3.58.1                plotly_4.10.4              
 [19] httpuv_1.6.14               sctransform_0.4.1          
 [21] spam_2.10-0                 zip_2.3.1                  
 [23] spatstat.sparse_3.0-3       reticulate_1.34.0          
 [25] cowplot_1.1.3               pbapply_1.7-2              
 [27] RColorBrewer_1.1-3          lubridate_1.9.3            
 [29] abind_1.4-5                 zlibbioc_1.48.0            
 [31] Rtsne_0.17                  GenomicRanges_1.54.1       
 [33] purrr_1.0.2                 BiocGenerics_0.48.1        
 [35] RCurl_1.98-1.14             circlize_0.4.16            
 [37] GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [39] S4Vectors_0.40.2            ggrepel_0.9.5              
 [41] irlba_2.3.5.1               listenv_0.9.1              
 [43] spatstat.utils_3.0-4        goftest_1.2-3              
 [45] RSpectra_0.16-1             spatstat.random_3.2-2      
 [47] fitdistrplus_1.1-11         parallelly_1.36.0          
 [49] DelayedMatrixStats_1.24.0   leiden_0.4.3.1             
 [51] codetools_0.2-19            DelayedArray_0.28.0        
 [53] tidyselect_1.2.0            shape_1.4.6                
 [55] farver_2.1.1                matrixStats_1.2.0          
 [57] stats4_4.3.1                spatstat.explore_3.2-5     
 [59] jsonlite_1.8.8              ellipsis_0.3.2             
 [61] progressr_0.14.0            ggridges_0.5.6             
 [63] survival_3.5-7              tools_4.3.1                
 [65] ica_1.0-3                   glue_1.7.0                 
 [67] gridExtra_2.3               SparseArray_1.2.3          
 [69] DESeq2_1.42.0               MatrixGenerics_1.14.0      
 [71] GenomeInfoDb_1.38.5         withr_3.0.0                
 [73] combinat_0.0-8              fastmap_1.1.1              
 [75] fansi_1.0.6                 digest_0.6.34              
 [77] timechange_0.3.0            R6_2.5.1                   
 [79] mime_0.12                   ggprism_1.0.4              
 [81] samr_3.0                    colorspace_2.1-0           
 [83] scattermore_1.2             tensor_1.5                 
 [85] spatstat.data_3.0-4         utf8_1.2.4                 
 [87] tidyr_1.3.1                 generics_0.1.3             
 [89] httr_1.4.7                  htmlwidgets_1.6.4          
 [91] S4Arrays_1.2.0              GSA_1.03.2                 
 [93] uwot_0.1.16                 pkgconfig_2.0.3            
 [95] gtable_0.3.4                lmtest_0.9-40              
 [97] impute_1.76.0               XVector_0.42.0             
 [99] htmltools_0.5.7             dotCall64_1.1-1            
[101] scales_1.3.0                Biobase_2.62.0             
[103] png_0.1-8                   snakecase_0.11.1           
[105] rstudioapi_0.15.0           reshape2_1.4.4             
[107] nlme_3.1-164                zoo_1.8-12                 
[109] GlobalOptions_0.1.2         stringr_1.5.1              
[111] KernSmooth_2.23-22          parallel_4.3.1             
[113] miniUI_0.1.1.1              vipor_0.4.7                
[115] ggrastr_1.0.2               pillar_1.9.0               
[117] grid_4.3.1                  vctrs_0.6.5                
[119] RANN_2.6.1                  promises_1.2.1             
[121] shinyFiles_0.9.3            xtable_1.8-4               
[123] cluster_2.1.6               beeswarm_0.4.0             
[125] paletteer_1.6.0             locfit_1.5-9.8             
[127] cli_3.6.2                   compiler_4.3.1             
[129] rlang_1.1.3                 crayon_1.5.2               
[131] future.apply_1.11.1         labeling_0.4.3             
[133] rematch2_2.1.2              plyr_1.8.9                 
[135] forcats_1.0.0               fs_1.6.3                   
[137] ggbeeswarm_0.7.2            stringi_1.8.3              
[139] BiocParallel_1.36.0         viridisLite_0.4.2          
[141] deldir_2.0-2                munsell_0.5.0              
[143] lazyeval_0.2.2              spatstat.geom_3.2-8        
[145] Matrix_1.6-5                RcppHNSW_0.5.0             
[147] patchwork_1.2.0             sparseMatrixStats_1.14.0   
[149] bit64_4.0.5                 future_1.33.1              
[151] statmod_1.5.0               shiny_1.8.0                
[153] SummarizedExperiment_1.32.0 ROCR_1.0-11                
[155] igraph_1.6.0                bit_4.0.5

EDIT (I have found clues about how to solve the problem!):

It is because in the function SCTransform(), the default option ncell = 5000 split my dataset in two at some point because with verbose = TRUE, I can see:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset

Indeed, my dataset is composed of 5640 cells, thus using the default option ncell = 5000 might produce some weird calculations maybe? Therefore I have tried to use ncell = 5640, in addition to vars.to.regress = "mtRNA.percent", and it works fine (I have a different UMAP indicating that regressing out mitochondrial RNA worked). Plus, I don't have the two following lines:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset

indicating that the calculations are at some point splitted in two I guess.

Remark 1: I have updated back my Seurat package to use the current version of the function (not the one you sent us with remotes::install_github("sataijalab/seurat", ref="develop") ), and the problem can be solved using ncell = your_number_of_cells_in_your_dataset

Remark 2: It seems that without any regression, ncell can be set to less than 5000 (I tried 5640/2), and the output is OK. Therefore, there is a problem with ncell only when using vars.to.regress. I hope it will help troubleshooting the function.

In any case, @saketkc , I think you should update the SCTransform function about this ncell option; either by setting it by default to the number of cells in the sample, or by updating the calculations that I know nothing about. Plus, you should talk about ncell in your tutorial vignette, because I didn't see it anywhere. Otherwise, many people will have the same issue...

Thanks again for your amazing work, Best regards,

kew24 commented 8 months ago

To add to this – I was seeing the same original issue of vars.to.regress not changing anything with Seurat_5.0.0. Can confirm that (for whatever reason) adding ncells = ncol(seurat_object) seemed to do the trick, as now my UMAP coordinates & clusters are different when I use different variables for vars.to.regress (thanks, @CroixJeremy2!).

Never installed from develop so I can't speak into the Error: None of the requested variables to regress are present in the object error. I've been using Seurat_5.0.0 this whole time.

satijalab / seurat

SCTransform not regressing out variables - Seurat v5.0.1 #8148

Example 1: With Regression

Example 2: Without Regression

EDIT (I have found clues about how to solve the problem!):