satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 904 forks source link

FindTransfer anchors fails to integrate two prior CCA-integrated Seurat objects - please help #3551

Closed semmrich closed 3 years ago

semmrich commented 4 years ago

Hi,

I am trying to apply the "Cell type classification using an integrated reference" from the Seurat Vignette https://satijalab.org/seurat/v3.2/integration.html

I integrated 4 query Seurat datasets following this Vignettes first part, retrieving an object

S.mmuBM.integrated
An object of class Seurat 
58801 features across 19480 samples within 3 assays 
Active assay: integrated (3421 features, 3421 variable features)
 2 other assays present: RNA, ADT
 3 dimensional reductions calculated: pca, tsne, umap

I then moved on to integrate 3 reference Seurat datasets into

An object of class Seurat 
24540 features across 64650 samples within 2 assays 
Active assay: integrated (4656 features, 4656 variable features)
 1 other assay present: RNA
 3 dimensional reductions calculated: pca, tsne, umap

When I try to project the reference it fails at the first step:

tic()
Transfer.anchors <- FindTransferAnchors(reference = S.mmuREF.integrated, query = S.mmuBM.integrated, dims = 1:50, features = intersect(S.mmuREF.integrated@assays[["integrated"]]@var.features,S.mmuBM.integrated@assays[["integrated"]]@var.features))
Performing PCA on the provided reference using 1764 features as input.
Projecting PCA
Error in Loadings(object = reference[[reduction]])[features, dims] : 
  subscript out of bounds
toc()
62.89 sec elapsed

While the query integration still retains an ADT assay slot, even with that assay removed I encounter the same error. I do not understand what is wrong? The error log suggests that the feature loadings do not match. But if a new integrated PCA is performed on an intersect of the VariableFeatures between ref and query as I set with features = intersect(), then why do the individual (PCA) feature loadings of each ref and query matter? I will try to run the FindTransferAnchors on both ref and query without any prior dimreds, and if that fails will run the features = intersect() output as fixed VariableFeatures, hope that sets it. But from the Vignette it seems the "pancreas.query" did not had any VaribleFeature determined, while the ref "pancreas.integrated" had. How come those feature loadings "match" and it works?

I will appreciate any helps in this matter.

Many Thanks in advance, Stephan

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SeuratData_0.2.1            dyno_0.1.2                  dynwrap_1.2.1               dynplot_1.0.2.9000          dynmethods_1.0.5           
 [6] dynguidelines_1.0.0         dynfeature_1.0.0.9000       corrplot_0.84               scmap_1.11.0                MAST_1.15.0                
[11] Seurat_3.2.1                patchwork_1.0.1             DropletUtils_1.9.11         scran_1.17.16               scater_1.17.4              
[16] SingleCellExperiment_1.11.6 SummarizedExperiment_1.19.6 DelayedArray_0.15.7         Biobase_2.49.1              GenomicRanges_1.41.6       
[21] GenomeInfoDb_1.25.11        IRanges_2.23.10             S4Vectors_0.27.12           BiocGenerics_0.35.4         tictoc_1.0                 
[26] hash_2.2.6.1                futile.matrix_1.2.7         Matrix_1.3-0                gridExtra_2.3               gtable_0.3.0               
[31] rowr_1.1.3                  fgsea_1.15.0                VennDiagram_1.6.20          futile.logger_1.4.3         data.table_1.13.0          
[36] matrixStats_0.56.0          GSVA_1.37.4                 BBmisc_1.11                 tibble_3.0.3                readxl_1.3.1               
[41] ggplot2_3.3.2               gplots_3.0.4                dplyr_1.0.2                

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1            R.methodsS3_1.8.1         GA_3.2                    tidyr_1.1.2               knitr_1.29               
  [6] bit64_4.0.5               irlba_2.3.3               R.utils_2.10.1            rpart_4.1-15              RCurl_1.98-1.2           
 [11] generics_0.0.2            cowplot_1.1.0             lambda.r_1.2.4            RSQLite_2.2.0             RANN_2.6.1               
 [16] carrier_0.1.0             proxy_0.4-24              future_1.18.0             bit_4.0.4                 spatstat.data_1.4-3      
 [21] httpuv_1.5.4              assertthat_0.2.1          viridis_0.5.1             xfun_0.17                 RMTstat_0.3              
 [26] hms_0.5.3                 promises_1.1.1            fansi_0.4.1               caTools_1.18.0            igraph_1.2.5             
 [31] DBI_1.1.0                 htmlwidgets_1.5.1         dyndimred_1.0.3           purrr_0.3.4               ellipsis_0.3.1           
 [36] RSpectra_0.16-0           backports_1.1.9           annotate_1.67.1           RcppParallel_5.0.2        deldir_0.1-28            
 [41] vctrs_0.3.4               remotes_2.2.0             ROCR_1.0-11               abind_1.4-5               withr_2.2.0              
 [46] ggforce_0.3.2             googleVis_0.6.6           checkmate_2.0.0           sctransform_0.2.1         goftest_1.2-2            
 [51] cluster_2.1.0             lazyeval_0.2.2            crayon_1.3.4              babelwhale_1.0.1          labeling_0.3             
 [56] edgeR_3.31.4              pkgconfig_2.0.3           tweenr_1.0.1              nlme_3.1-149              vipor_0.4.5              
 [61] rlang_0.4.7               globals_0.12.5            lifecycle_0.2.0           miniUI_0.1.1.1            rsvd_1.0.3               
 [66] cellranger_1.1.0          randomForest_4.6-14       polyclip_1.10-0           lmtest_0.9-38             graph_1.67.1             
 [71] Rhdf5lib_1.11.3           zoo_1.8-8                 beeswarm_0.2.3            ggridges_0.5.2            processx_3.4.4           
 [76] png_0.1-7                 viridisLite_0.3.0         bitops_1.0-6              R.oo_1.24.0               KernSmooth_2.23-17       
 [81] rhdf5filters_1.1.2        blob_1.2.1                DelayedMatrixStats_1.11.1 lambda.tools_1.0.9        stringr_1.4.0            
 [86] readr_1.3.1               scales_1.1.1              memoise_1.1.0             GSEABase_1.51.1           magrittr_1.5             
 [91] plyr_1.8.6                ica_1.0-2                 gdata_2.18.0              zlibbioc_1.35.0           compiler_4.0.2           
 [96] dqrng_0.2.1               RColorBrewer_1.1-2        fitdistrplus_1.1-1        cli_2.0.2                 XVector_0.29.3           
[101] listenv_0.8.0             pbapply_1.4-3             ps_1.3.4                  formatR_1.7               MASS_7.3-53              
[106] mgcv_1.8-33               tidyselect_1.1.0          stringi_1.5.3             yaml_2.2.1                BiocSingular_1.5.0       
[111] locfit_1.5-9.4            ggrepel_0.8.2             fastmatch_1.1-0           tools_4.0.2               future.apply_1.6.0       
[116] rstudioapi_0.11           foreach_1.5.0             bluster_0.99.1            rje_1.10.16               farver_2.0.3             
[121] Rtsne_0.15                ggraph_2.0.3              proxyC_0.1.5              digest_0.6.25             shiny_1.5.0              
[126] dynparam_1.0.1            Rcpp_1.0.5                scuttle_0.99.13           later_1.1.0.1             RcppAnnoy_0.0.16         
[131] httr_1.4.2                AnnotationDbi_1.51.3      colorspace_1.4-1          XML_3.99-0.5              tensor_1.5               
[136] ranger_0.12.1             reticulate_1.16           splines_4.0.2             uwot_0.1.8                lmds_0.1.0               
[141] statmod_1.4.34            spatstat.utils_1.17-0     graphlayouts_0.7.0        plotly_4.9.2.1            xtable_1.8-4             
[146] jsonlite_1.7.1            futile.options_1.0.1      tidygraph_1.2.0           spatstat_1.64-1           testthat_2.3.2           
[151] R6_2.4.1                  pillar_1.4.6              htmltools_0.5.0           mime_0.9                  glue_1.4.2               
[156] fastmap_1.0.1             BiocParallel_1.23.2       BiocNeighbors_1.7.0       class_7.3-17              codetools_0.2-16         
[161] lattice_0.20-41           ggbeeswarm_0.6.0          leiden_0.3.3              gtools_3.8.2              survival_3.2-3           
[166] limma_3.45.13             dynutils_1.0.5            munsell_0.5.0             e1071_1.7-3               rhdf5_2.33.7             
[171] GenomeInfoDbData_1.2.3    iterators_1.0.12          HDF5Array_1.17.3          reshape2_1.4.4
semmrich commented 3 years ago

Update:

When I leave out the PCA/Neighbors/Clustering/UMAP step and use the query without dimreds,

S.mmuBM.integrated
An object of class Seurat 
58801 features across 19480 samples within 3 assays 
Active assay: integrated (3421 features, 3421 variable features)
 2 other assays present: RNA, ADT

it still fails to run through:

Transfer.anchors <- FindTransferAnchors(reference = S.mmuREF.integrated, query = S.mmuBM.integrated, dims = 1:50)
Performing PCA on the provided reference using 4656 features as input.
Projecting PCA
Error in Loadings(object = reference[[reduction]])[features, dims] : 
  subscript out of bounds
 toc()
131.93 sec elapsed

Transfer.anchors <- FindTransferAnchors(reference = S.mmuREF.integrated, query = S.mmuBM.integrated, dims = 1:50,
+                                         features = intersect(S.mmuREF.integrated@assays[["integrated"]]@var.features,S.mmuBM.integrated@assays[["integrated"]]@var.features))
Performing PCA on the provided reference using 1764 features as input.
Projecting PCA
Error in Loadings(object = reference[[reduction]])[features, dims] : 
  subscript out of bounds
toc()
58.95 sec elapsed

Interestingly, the Vignette example "panc8" runs smoothly while showing the same setup of the two Seurat objects:

pancreas.query
An object of class Seurat 
34363 features across 638 samples within 1 assay 
Active assay: RNA (34363 features, 2000 variable features)
pancreas.integrated
An object of class Seurat 
36363 features across 5683 samples within 2 assays 
Active assay: integrated (2000 features, 2000 variable features)
 1 other assay present: RNA
 2 dimensional reductions calculated: pca, umap

The query has no dimreds, while the ref has. I don't get it... Whats more:

>S.mmuREF.integrated <- ScaleData(object = S.mmuREF.integrated, assay = "integrated", , features = intersect(rownames(S.mmuREF.integrated),rownames(S.mmuBM.integrated)))
Centering and scaling data matrix
  |=================================================================================================================================================| 100%
>S.mmuREF.integrated <- RunPCA(object = S.mmuREF.integrated, assay = "integrated", npcs = 50, verbose = TRUE, features = intersect(rownames(S.mmuREF.integrated),rownames(S.mmuBM.integrated)))
>S.mmuREF.integrated <- FindNeighbors(object = S.mmuREF.integrated, reduction = "pca", dims = 1:50)
Computing nearest neighbor graph
Computing SNN
>S.mmuREF.integrated <- FindClusters(object = S.mmuREF.integrated, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 64650
Number of edges: 2539416

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9242
Number of communities: 23
Elapsed time: 22 seconds
> S.mmuREF.integrated <- RunUMAP(object = S.mmuREF.integrated, dims = 1:50)
16:47:21 UMAP embedding parameters a = 0.9922 b = 1.112
16:47:21 Read 64650 rows and found 50 numeric columns
16:47:21 Using Annoy for neighbor search, n_neighbors = 30
16:47:21 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:47:37 Writing NN index file to temp file C:\Users\FUCHSD~1\AppData\Local\Temp\RtmpY9vD9W\file2af42e0b14bb
16:47:37 Searching Annoy index using 1 thread, search_k = 3000
16:47:58 Annoy recall = 100%
16:47:59 Commencing smooth kNN distance calibration using 1 thread
16:48:05 Initializing from normalized Laplacian + noise
16:48:13 Commencing optimization for 200 epochs, with 2920214 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:49:32 Optimization finished
> S.mmuREF.integrated <- ScaleData(object = S.mmuREF.integrated, assay = "integrated", features = intersect(rownames(S.mmuREF.integrated),rownames(S.mmuBM.integrated)))
Centering and scaling data matrix
  |=================================================================================================================================================| 100%
>dim(S.mmuREF.integrated@reductions[["pca"]]@feature.loadings)
[1] 1764   50
> Transfer.anchors <- FindTransferAnchors(reference = S.mmuREF.integrated, query = S.mmuBM.integrated, dims = 1:50, features = intersect(S.mmuREF.integrated@assays[["integrated"]]@var.features,S.mmuBM.integrated@assays[["integrated"]]@var.features))
Performing PCA on the provided reference using 1764 features as input.
Projecting PCA
Error in Loadings(object = reference[[reduction]])[features, dims] : 
  subscript out of bounds

This is very puzzling... All feature dims can be found in each ref and query, as in the vignette, still fails to retrieve anchors??? Any ideas that I could try are highly appreciated... Thx, Stephan

semmrich commented 3 years ago

UPDATE:

To work with datasets for reference and query that share the exact same features or rownames in the Seurat object I intersected them (although this is actually not ideal and maybe not intended by the developers as well, since FindIntegrationAnchors can handle different feature loadings in the Seurat objects to integrate):

>S.mmuBM.integrated
An object of class Seurat 
60380 features across 19480 samples within 3 assays 
Active assay: integrated (5000 features, 5000 variable features)
 2 other assays present: RNA, ADT

> S.mmuREF.integrated
An object of class Seurat 
24540 features across 64650 samples within 2 assays 
Active assay: integrated (4656 features, 4656 variable features)
 1 other assay present: RNA

> common.features <- unique(intersect(rownames(S.mmuREF.integrated),rownames(S.mmuBM.integrated)))
> length(common.features)
[1] 2533
> S.mmuREF.integrated.cf <- S.mmuREF.integrated[common.features,]
> S.mmuBM.integrated.cf <- S.mmuBM.integrated[common.features,]

> Transfer.anchors <- FindTransferAnchors(reference = S.mmuREF.integrated.cf, query = S.mmuBM.integrated.cf, dims = 1:50)
Performing PCA on the provided reference using 2533 features as input.
Projecting PCA
Error in Loadings(object = reference[[reduction]])[features, dims] : 
  subscript out of bounds

At this point I am out of ideas, and have to abandon the use of the TransferAnchorsSeurat feature. Although I am still a big Fan of the Seurat package, it feels a little disappointing that this particular function does not work on a "real life" example, and while the error msg seems to be informative, the problem cannot be solved, even if the datasets feature loadings became smaller and converging up to a point of impairing a biological interpretation of the results. Maybe so finds a solution for this... St

jaisonj708 commented 3 years ago

Perhaps try changing the dims parameter - do you have less than 50 PCs?

Also feel free to send a downsampled version of your object to seuratpackage@gmail.com so we can more quickly diagnose the issue.

semmrich commented 3 years ago

Hi @jaisonj708, Thank you for the input! Actually all of my objects generated with Seurat have the 50 PCs, but I will take a closer look at it indeed.

Here is sth interesting: The issue was vexing me such that I set out to create a reprex. I did, and in order to run it one would need to download and unpack three additional datasets from 10XGenomics support in addition to the ones provided with SeuratData:

library(Seurat)
library(SeuratData)
InstallData("cbmc")
InstallData("pbmc3k")

data("cbmc")
data("pbmc3k")
# https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.1.0/manual_5k_pbmc_NGSC3_ch1
pbmc5k.data <- Read10X(data.dir = ".../pbmc5k/filtered_feature_bc_matrix/")
pbmc5k <- CreateSeuratObject(counts = pbmc5k.data, project = "pbmc5k")
# https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/frozen_bmmc_healthy_donor1
bmmc1.data <- Read10X(data.dir = ".../bmmc1/filtered_matrices_mex/hg19/")
bmmc1 <- CreateSeuratObject(counts = bmmc1.data, project = "bmmc1")
# https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/frozen_bmmc_healthy_donor2
bmmc2.data <- Read10X(data.dir = ".../bmmc2/filtered_matrices_mex/hg19/")
bmmc2 <- CreateSeuratObject(counts = bmmc2.data, project = "bmmc2")

blood.list <- list(CB=cbmc,PB3k=pbmc3k,PB5k=pbmc5k,BM1=bmmc1,BM2=bmmc2)
for (i in 1:length(blood.list)) {
    blood.list[[i]] <- NormalizeData(blood.list[[i]], verbose = FALSE)
    blood.list[[i]] <- FindVariableFeatures(blood.list[[i]], selection.method = "vst", 
        nfeatures = 2000, verbose = FALSE)
}
reference.list <- blood.list[c("CB","PB3k","BM1")]
blood.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
blood.integrated <- IntegrateData(anchorset = blood.anchors, dims = 1:30)
library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(blood.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
blood.integrated <- ScaleData(blood.integrated, verbose = FALSE)
blood.integrated <- RunPCA(blood.integrated, npcs = 30, verbose = FALSE)
blood.integrated <- RunUMAP(blood.integrated, reduction = "pca", dims = 1:30)
p1 <- DimPlot(blood.integrated, reduction = "umap", group.by = "orig.ident")
p2 <- DimPlot(blood.integrated, reduction = "umap", group.by = "rna_annotations", label = TRUE, repel = TRUE) + NoLegend()
p3 <- DimPlot(blood.integrated, reduction = "umap", group.by = "protein_annotations", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2 + p3
blood.query <- blood.list[c("PB5k","BM2")]
blood.query.anchors <- FindIntegrationAnchors(object.list = blood.query, dims = 1:30)
blood.query.integrated <- IntegrateData(anchorset = blood.query.anchors, dims = 1:30)
blood.anchors <- FindTransferAnchors(reference = blood.integrated, query = blood.query.integrated, dims = 1:30)
predictions.rna <- TransferData(anchorset = blood.anchors, refdata = blood.integrated$rna_annotation, dims = 1:30)
blood.query.integrated <- AddMetaData(blood.query.integrated, metadata = predictions.rna)
blood.query.integrated <- ScaleData(blood.query.integrated, verbose = FALSE)
blood.query.integrated <- RunPCA(blood.query.integrated, npcs = 30, verbose = FALSE)
blood.query.integrated <- RunUMAP(blood.query.integrated, reduction = "pca", dims = 1:30)
DimPlot(blood.query.integrated, reduction = "umap", group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend()

Long story short - it WORKS!! It's not Seurat, it's in my data somehow. I will go over the dims and give it a fresh start with a downsampled version, if that reproduces the error, I would send it to the email address you suggested. I will close this issue as soon as I solve it on my own or with some help.