satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.25k stars 904 forks source link

FindClusters returns no error but all cluster memberships are NA (on integrated data) #3520

Closed micdonato closed 3 years ago

micdonato commented 4 years ago

Hi all,

I am integrating three datasets and after integration I want to find cell clusters. However, after I do that, all my integrated_snn_res.* memberships are all

This is the code I am using:

###### Integration (Seurat v3 workflow) #######

library(Seurat)
library(data.table)
library(magrittr)

### load seurat object
srt_dataset1=readRDS("/labs/khatrilab/hongzheng/scRNAseq/robj/srt_dataset1.rds")
srt_dataset2=readRDS("/labs/khatrilab/hongzheng/scRNAseq/robj/srt_dataset2.rds")
srt_dataset3=readRDS("/labs/khatrilab/hongzheng/scRNAseq/robj/srt_dataset3.rds")

### SCT transform
srt_dataset1 <- SCTransform(srt_dataset1,variable.features.n=5000, vars.to.regress = "pct.mt")
srt_dataset2 <- SCTransform(srt_dataset2,variable.features.n=5000, vars.to.regress = "pct.mt")
srt_dataset3 <- SCTransform(srt_dataset3,variable.features.n=5000, vars.to.regress = "pct.mt")

### Integration anchors
srt.list=list(srt_dataset1,srt_dataset2,srt_dataset3)
srt.features <- SelectIntegrationFeatures(object.list = srt.list, nfeatures = 5000)
srt.list <- PrepSCTIntegration(object.list = srt.list, anchor.features = srt.features)
names(srt.list)=c("dataset1","dataset2","dataset3")
srt.anchors <- FindIntegrationAnchors(object.list = srt.list, normalization.method = "SCT",anchor.features = srt.features)

all_features <- unique(c(srt.anchors@anchor.features))

### Final integration
srt.integrated <- IntegrateData(anchorset = srt.anchors, normalization.method = "SCT",features.to.integrate=all_features)

rm(srt.dataset2)
rm(srt.dataset1)
rm(srt.dataset3)
rm(srt.list)
rm(srt.anchors)

gc()

DefaultAssay(srt.integrated) = 'integrated'

##  Run the standard workflow for visualization and clustering
##  this following might not be needed as IntegrateData with SCT normalization already scales and centers
# srt.integrated <- ScaleData(srt.integrated, verbose = FALSE)
srt.integrated <- RunPCA(srt.integrated, npcs = 50, verbose = FALSE)
# UMAP and Clustering
srt.integrated <- RunUMAP(srt.integrated, reduction = "pca", dims = 1:50)
srt.integrated <- FindNeighbors(srt.integrated, reduction = "pca", dims = 1:50)
srt.integrated <- FindClusters(srt.integrated, resolution = c(0.2,0.5,0.8,1,1.2))

After that, this is what I get:

 > srt.integrated@meta.data$integrated_snn_res.0.8 %>% unique
[1] <NA>
42 Levels: 0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 ... 9
> 

(same for all the resolutions)

All the FindClusters steps finish without an error, and they report the right number of clusters. The only message is that some singletons were found (usually two or three).

We do not have this issue with single datasets.

The SessionInfo is the following:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5      data.table_1.12.8 Seurat_3.1.5     

loaded via a namespace (and not attached):
 [1] rsvd_1.0.3          ggrepel_0.8.2       Rcpp_1.0.4.6       
 [4] ape_5.3             lattice_0.20-38     ica_1.0-2          
 [7] tidyr_1.0.3         listenv_0.8.0       png_0.1-7          
[10] zoo_1.8-8           assertthat_0.2.1    digest_0.6.25      
[13] lmtest_0.9-37       R6_2.4.1            plyr_1.8.6         
[16] ggridges_0.5.2      httr_1.4.1          ggplot2_3.3.0      
[19] pillar_1.4.4        rlang_0.4.6         lazyeval_0.2.2     
[22] irlba_2.3.3         leiden_0.3.3        Matrix_1.2-18      
[25] reticulate_1.15     splines_3.6.3       Rtsne_0.15         
[28] stringr_1.4.0       htmlwidgets_1.5.1   uwot_0.1.8         
[31] igraph_1.2.5        munsell_0.5.0       sctransform_0.2.1  
[34] compiler_3.6.3      pkgconfig_2.0.3     globals_0.12.5     
[37] htmltools_0.4.0     tidyselect_1.1.0    gridExtra_2.3      
[40] tibble_3.0.1        RANN_2.6.1          codetools_0.2-16   
[43] fitdistrplus_1.0-14 future_1.17.0       viridisLite_0.3.0  
[46] crayon_1.3.4        dplyr_0.8.5         MASS_7.3-51.5      
[49] rappdirs_0.3.1      grid_3.6.3          tsne_0.1-3         
[52] nlme_3.1-144        jsonlite_1.6.1      gtable_0.3.0       
[55] lifecycle_0.2.0     npsurv_0.4-0.1      scales_1.1.1       
[58] KernSmooth_2.23-16  stringi_1.4.6       future.apply_1.5.0 
[61] lsei_1.2-0.1        pbapply_1.4-2       reshape2_1.4.4     
[64] ROCR_1.0-11         ellipsis_0.3.1      vctrs_0.3.0        
[67] cowplot_1.0.0       RcppAnnoy_0.0.16    RColorBrewer_1.1-2 
[70] tools_3.6.3         glue_1.4.1          purrr_0.3.4        
[73] parallel_3.6.3      survival_3.1-8      colorspace_1.4-1   
[76] cluster_2.1.0       plotly_4.9.2.1      patchwork_1.0.0    
> 

Am I doing something wrong? Everything seems to finish without errors. It takes some time but it gets to the end.

Small edit: this is the output of FindClusters:

> srt.integrated %<>% FindClusters
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 377358
Number of edges: 11810579

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9215
Number of communities: 39
Elapsed time: 282 seconds
>

and the updated output (it found 39 clusters instead of 43):

> srt.integrated@meta.data$integrated_snn_res.0.8 %>% unique
[1] <NA>
39 Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 38
> 
timoast commented 4 years ago

As stated in the integration vignette, you should not run ScaleData after integration when using SCTransform-normalized data. I'm not sure if that is the cause of this issue, but can you try without running ScaleData?

micdonato commented 4 years ago

I followed this tutorial but I see that the vignette does not have the scaling step.

I checked scaled data and non scaled data of the integrated object and they are exactly the same, with the exception that the scaled one is capped at 10 for all objects.

I will retry the integration without scaling and see what happens (it will take a day or so to get it done).

Thanks for your help!

micdonato commented 3 years ago

I wanted to give an update on this. It was the scaling on the data, most likely. Once I removed that from the script, the clusters were assigned properly.

We have another issue with the clusters but that is for another thread. This can be closed and labeled as solved.

Cristinacondel1 commented 1 year ago

Hi,

I wanted to reopen this issue since I'm having the same problem but my data is not integrated, so the solution offered here doesn't work for me.

I'm working with the public GSE181919 dataset downloaded from GEO they offer an UMI_count file and a metadata file which I'm loading in the following way:

matrix_obj <- read.table("GSE181919_UMI_counts.txt", header = T, sep = "", dec = ".")
mat <- Matrix(as.matrix(matrix_obj),sparse=TRUE)
seurat_mtx <- CreateSeuratObject(counts = mat)
metadata <- read.table("GSE181919_Barcode_metadata.txt", header = T, sep = "", dec = ".")
seurat_mtx@meta.data <- metadata

Then I'm trying to run the normal processing workflow:

seurat_mtx <- NormalizeData(seurat_mtx, normalization.method = "LogNormalize", scale.factor = 10000)
seurat_mtx <- FindVariableFeatures(seurat_mtx, selection.method = "vst", nfeatures = 2000)

# Run the standard workflow for visualization and clustering
seurat_mtx <- ScaleData(seurat_mtx, verbose = FALSE)
seurat_mtx <- RunPCA(seurat_mtx, features = VariableFeatures(object = seurat_mtx))

# t-SNE and Clustering
seurat_mtx <- RunUMAP(seurat_mtx, reduction = "pca", dims = 1:20)
seurat_mtx <- FindNeighbors(seurat_mtx, reduction = "pca", dims = 1:20)
seurat_mtx <- FindClusters(seurat_mtx, resolution = 0.5)

seurat_mtx@meta.data$RNA_snn_res.0.5 %>% unique

Which works fine but then the output of FindClusters is:

seurat_mtx@meta.data$RNA_snn_res.0.5 %>% unique
[1] <NA>
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

I can't figure out what is going wrong since each step seems to work so it would be really nice if you could help me!