satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.28k stars 910 forks source link

Data integration does not work for few cells when `l2.norm = False` #5563

Closed FrickTobias closed 1 year ago

FrickTobias commented 2 years ago

Description

I've successfully integrated my two datasets using L2-normailzation but I get an error when it is set to false. FindIntegrationAnchors() finishes without any (critical) error but somehow the output yields an error when input into IntegrateData().

Error in idx[i, ] <- res[[i]][[1]] : 
  number of items to replace is not a multiple of replacement length

Some things I have noted

See below sessionInfo() for details of my testing & output.

My original issue

l2.norm = TRUE (works)

> cells.anchors <- FindIntegrationAnchors(dataset_list)
Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 540 anchors
Filtering anchors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s  
Warning message:
In FilterAnchors(object = object.pair, assay = assay, slot = slot,  :
  Number of anchor cells is less than k.filter. Retaining all anchors.
> cells.combined <- IntegrateData(anchorset = cells.anchors)
Merging dataset 2 into 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data

l2.norm = FALSE (doesn't work)

> cells.anchors.nol2norm <- FindIntegrationAnchors(dataset_list, l2.norm = FALSE)
Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 132 anchors
Filtering anchors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
Warning message:
In FilterAnchors(object = object.pair, assay = assay, slot = slot,  :
  Number of anchor cells is less than k.filter. Retaining all anchors.
> cells.combined.nol2norm <- IntegrateData(anchorset = cells.anchors.nol2norm)
Merging dataset 2 into 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
Error in idx[i, ] <- res[[i]][[1]] : 
  number of items to replace is not a multiple of replacement length
> 

Error reproduced in pbmc3k

library(Seurat)
library(SeuratData)
library(patchwork)

InstallData("pbmc3k")

# load dataset
LoadData("pbmc3k")

# Modify data to add a subset group 
num_cells <- ncol(pbmc3k)
group_0 <- rep(0, 100)
group_1 <- rep(1, num_cells - 100)
subset_group <- c(group_0, group_1)
pbmc3k <- AddMetaData(pbmc3k, subset_group, "subset_group")

# split the dataset into a list of two seurat objects (0 and 1)
pbmc3k.list <- SplitObject(pbmc3k, split.by = "subset_group")

# normalize and identify variable features for each dataset independently
pbmc3k.list <- lapply(X = pbmc3k.list, FUN = function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = pbmc3k.list)

# find integration anchors
pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc3k.list, anchor.features = features, l2.norm = FALSE)

# this command creates an 'integrated' data assay
pbmc3k.combined <- IntegrateData(anchorset = pbmc.anchors)

Traceback

Error in idx[i, ] <- res[[i]][[1]] : 
number of items to replace is not a multiple of replacement length
8.
AnnoySearch(index = idx, query = query, k = k, search.k = search.k, 
include.distance = include.distance)
7.
AnnoyNN(data = structure(c(-0.78198620153428, 13.688986633151, 
7.82888656993388, 13.1645685032467, 20.3813868710973, -3.30730295896355, 
13.6005081304036, 16.8361601973049, -5.47081872018703, -4.87841076428184, 
5.3159342734401, -3.17862955374108, 9.88786165461944, 20.3737414147917, ...
6.
do.call(what = "AnnoyNN", args = args)
5.
NNHelper(data = data.use[anchors.cells2, ], query = data.use, 
k = k, method = nn.method, n.trees = n.trees, eps = eps)
4.
FindWeights(object = merged.obj, integration.name = integration.name, 
reduction = dr.weights, dims = dims, k = k.weight, sd.weight = sd.weight, 
eps = eps, verbose = verbose)
3.
RunIntegration(filtered.anchors = filtered.anchors, normalization.method = normalization.method, 
reference = object.1, query = object.2, cellnames.list = cellnames.list, 
new.assay.name = new.assay.name, features.to.integrate = features.to.integrate, 
features = features, dims = dims, weight.reduction = weight.reduction, ...
2.
PairwiseIntegrateReference(anchorset = anchorset, new.assay.name = new.assay.name, 
normalization.method = normalization.method, features = features, 
features.to.integrate = features.to.integrate, dims = dims, 
k.weight = k.weight, weight.reduction = weight.reduction, ...
1.
IntegrateData(anchorset = pbmc.anchors)

OS and sessionInfo()

System Software Overview:

System Version: macOS 11.6.2 (20G314) Kernel Version: Darwin 20.6.0

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin20.6.0 (64-bit)
Running under: macOS Big Sur 11.6.2

Matrix products: default
LAPACK: /usr/local/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pbmc3k.SeuratData_3.1.4     SeuratData_0.2.1            devtools_2.4.3              usethis_2.1.5               Dict_0.1.0                  patchwork_1.1.1            
 [7] ggpubr_0.4.0                extrafont_0.17              plotrix_3.8-2               gridExtra_2.3               ggrepel_0.9.1               Matrix_1.4-0               
[13] SeuratObject_4.0.4          Seurat_4.0.6                scales_1.1.1                forcats_0.5.1               purrr_0.3.4                 readr_2.1.1                
[19] tibble_3.1.6                tidyverse_1.3.1             ggsci_2.9                   tidyr_1.1.4                 data.table_1.14.2           stringr_1.4.0              
[25] amap_0.8-18                 dplyr_1.0.7                 ggthemes_4.2.4              ggplot2_3.3.5               randomForest_4.6-14         cluster_2.1.2              
[31] Rtsne_0.15                  jcolors_0.0.4               RColorBrewer_1.1-2          DESeq2_1.34.0               SummarizedExperiment_1.24.0 Biobase_2.54.0             
[37] MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.1        GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.3           
[43] BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] utf8_1.2.2             reticulate_1.22        tidyselect_1.1.1       RSQLite_2.2.9          AnnotationDbi_1.56.2   htmlwidgets_1.5.4      grid_4.1.2             BiocParallel_1.28.3   
  [9] munsell_0.5.0          codetools_0.2-18       ica_1.0-2              future_1.23.0          miniUI_0.1.1.1         withr_2.4.3            colorspace_2.0-2       knitr_1.37            
 [17] rstudioapi_0.13        ROCR_1.0-11            ggsignif_0.6.3         tensor_1.5             Rttf2pt1_1.3.9         listenv_0.8.0          GenomeInfoDbData_1.2.7 polyclip_1.10-0       
 [25] bit64_4.0.5            rprojroot_2.0.2        parallelly_1.30.0      vctrs_0.3.8            generics_0.1.1         xfun_0.29              R6_2.5.1               locfit_1.5-9.4        
 [33] bitops_1.0-7           spatstat.utils_2.3-0   cachem_1.0.6           DelayedArray_0.20.0    assertthat_0.2.1       promises_1.2.0.1       gtable_0.3.0           globals_0.14.0        
 [41] processx_3.5.2         goftest_1.2-3          rlang_0.4.12           genefilter_1.76.0      splines_4.1.2          extrafontdb_1.0        rstatix_0.7.0          lazyeval_0.2.2        
 [49] spatstat.geom_2.3-1    broom_0.7.10           modelr_0.1.8           yaml_2.2.1             reshape2_1.4.4         abind_1.4-5            backports_1.4.1        httpuv_1.6.4          
 [57] tools_4.1.2            ellipsis_0.3.2         spatstat.core_2.3-2    sessioninfo_1.2.2      ggridges_0.5.3         Rcpp_1.0.7             plyr_1.8.6             zlibbioc_1.40.0       
 [65] RCurl_1.98-1.5         ps_1.6.0               prettyunits_1.1.1      rpart_4.1-15           deldir_1.0-6           pbapply_1.5-0          cowplot_1.1.1          zoo_1.8-9             
 [73] haven_2.4.3            fs_1.5.2               magrittr_2.0.1         scattermore_0.7        reprex_2.0.1           lmtest_0.9-39          RANN_2.6.1             fitdistrplus_1.1-6    
 [81] pkgload_1.2.4          hms_1.1.1              mime_0.12              evaluate_0.14          xtable_1.8-4           XML_3.99-0.8           readxl_1.3.1           testthat_3.1.2        
 [89] compiler_4.1.2         KernSmooth_2.23-20     crayon_1.4.2           htmltools_0.5.2        tzdb_0.2.0             mgcv_1.8-38            later_1.3.0            geneplotter_1.72.0    
 [97] lubridate_1.8.0        DBI_1.1.2              dbplyr_2.1.1           rappdirs_0.3.3         MASS_7.3-55            car_3.0-12             brio_1.1.3             cli_3.1.1             
[105] parallel_4.1.2         igraph_1.2.10          pkgconfig_2.0.3        plotly_4.10.0          spatstat.sparse_2.1-0  xml2_1.3.3             annotate_1.72.0        XVector_0.34.0        
[113] rvest_1.0.2            callr_3.7.0            digest_0.6.29          sctransform_0.3.2      RcppAnnoy_0.0.19       spatstat.data_2.1-2    Biostrings_2.62.0      cellranger_1.1.0      
[121] rmarkdown_2.11         leiden_0.3.9           uwot_0.1.11            curl_4.3.2             shiny_1.7.1            lifecycle_1.0.1        nlme_3.1-153           jsonlite_1.7.2        
[129] carData_3.0-4          desc_1.4.0             viridisLite_0.4.0      fansi_0.5.0            pillar_1.6.4           lattice_0.20-45        KEGGREST_1.34.0        fastmap_1.1.0         
[137] httr_1.4.2             pkgbuild_1.3.1         survival_3.2-13        glue_1.6.1             remotes_2.4.2          png_0.1-7              bit_4.0.4              stringi_1.7.6         
[145] blob_1.2.2             memoise_2.0.1          irlba_2.3.5            future.apply_1.8.1  

Tests and outputs

Uneven split works for l2.norm=TRUE

> pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc3k.list, anchor.features = features)
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 500 anchors
Filtering anchors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
Warning message:
In FilterAnchors(object = object.pair, assay = assay, slot = slot,  :
  Number of anchor cells is less than k.filter. Retaining all anchors.
> pbmc3k.combined <- IntegrateData(anchorset = pbmc.anchors)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data

k.filter = 99 doesn't help

> pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc3k.list, anchor.features = features, l2.norm = FALSE, k.filter = 99)
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 114 anchors
Filtering anchors
    Retained 46 anchors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
> pbmc3k.combined <- IntegrateData(anchorset = pbmc.anchors)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
Error in idx[i, ] <- res[[i]][[1]] : 
  number of items to replace is not a multiple of replacement length
In addition: Warning message:
In RunIntegration(filtered.anchors = filtered.anchors, normalization.method = normalization.method,  :
  Number of anchors is less than k.weight. Lowering k.weight for sample pair.

Even split is no problem

This is the code I used to randomly split the data.

# Modify with subset group 
num_cells <- length(pbmc3k$orig.ident)
set.seed(1)
subset_group <- floor( runif(num_cells, max = 2, min = 0) )
> pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc3k.list, anchor.features = features, l2.norm = FALSE)
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2674 anchors
Filtering anchors
    Retained 2031 anchors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=08s  
> 
> pbmc3k.combined <- IntegrateData(anchorset = pbmc.anchors)
Merging dataset 2 into 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
FrickTobias commented 2 years ago

Now I also checked that it dose not have to do with the order of the cells in pbmc3k

uneven random split (doesn't work)

# Modify data to add a subset group 
library(randomizr)
set.seed(1)
subset_group <- simple_ra(N = num_cells, prob = 0.05)
pbmc3k <- AddMetaData(pbmc3k, subset_group, "subset_group")
FrickTobias commented 2 years ago

No exact number of cells when there are too few cells

I ran the pmbc3k code above for different number of cells n (naively taken as the first n cells to define group 0 or 1).

integrate(few_cells, many_cell)

for (n_group_0 in seq(290, 310, 1)) {...}

> fails
[1] 290 291 292 293 295 296 297 298 299 303 304
> successes
[1] 294 300 301 302 305 306 307 308 309 310 

integrate(many_cells, few_cells)

Switch direction of integration for (n_group_1 in seq(290, 310, 1)) {...}

> fails
NULL
> successes
[1] 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310

Expanding n scope of integrate(many_cells, few_cells)

for (n_group_1 in seq(31, 332, 10)) {...}

> fails
[1]  31  41  51  71  81  91 101 111 121 131 141 151 171 181 191 201 211 221
> successes
[1]  61 161 231 241 251 261 271 281 291 301 311 321 331

Success / fail definition

res <- try(pbmc3k.combined <- IntegrateData(anchorset = pbmc.anchors))
if(inherits(res, "try-error")){
  fails <- c(fails, n_group_0) # or group 1
} else {
  successes <- c(successes, n_group_0) # or group 1
}
KoichiHashikawa commented 2 years ago

Same here. When I am performing 2nd round of integrative clustering on targeted cell types where each sample has ~50 cells. I received the same error when running FindIntegrationAnchors: number of items to replace is not a multiple of replacement length

alice-y-wang commented 2 years ago

I am also experiencing the same issue. Integrating datasets with 50 - 100 cells, and I get the same error when running IntegrateData()

FrickTobias commented 2 years ago

@KoichiHashikawa @alice-y-wang

Note that this issue is about when both of these are true:

If you are not using the l2.norm = FALSE option you are experiencing another issue (you will note that my original post specifies when that option is set to TRUE (default) it works without any issues).

StanleyYang01 commented 2 years ago

I have the same issue using the "IntegrateData()" function and it returned the error below. In my case, I used the default setting of I2.norm = TRUE when running FindIntegrationAnchors() function.

Error in idx[i, ] <- res[[i]][[1]] : number of items to replace is not a multiple of replacement length

FrickTobias commented 2 years ago

@StanleyYang01 If you are running this with l2.norm = TRUE it is not the same issue. Please see the title of the issue.

sentisci commented 2 years ago

I have the same issue using the "IntegrateData()" function and it returned the error below. In my case, I used the default setting of I2.norm = TRUE when running FindIntegrationAnchors() function.

Error in idx[i, ] <- res[[i]][[1]] : number of items to replace is not a multiple of replacement length

Were you able to resolve this issue ?

FrickTobias commented 2 years ago

@sentisci Please open a new issue or contact the person somewhere outside this thread if you wish to discuss something else than the described issue.

rsatija commented 1 year ago

The issue here is not that l2.norm is somehow causing an error, the issue is that integration is failing because a small number of anchors are being identified.

Seurat integration (as with all integration techniques) leverages shared information across similar cells (or anchors) to increase robustness. When there are few anchors, the integration process can struggle to proceed. L2-normalization helps for correcting scale differences across datasets, and setting this to F will decrease the number of anchors returned.

You can attempt to address this by increasing k.anchor (to find more anchors), decreasing k.weight (which allows integration to proceed with fewer anchors). Both these parameter settings can increase the risk of over integration.

FrickTobias commented 1 year ago

@rsatija Thank you for taking the time to answer! I'll try that!