twoneu commented 1 year ago

Hi! Thank you for this great tool. I am encountering the error in the title when running scDblFinder on a large dataset (CellRanger estimated ~20,000 cells):

Assuming the input to be a matrix of counts or expected counts.

Aggregating features...

Warning message:
"Quick-TRANSfer stage steps exceeded maximum (= 1905250)"
Creating ~11084 artificial doublets...

Dimensional reduction

Evaluating kNN...

Training model...

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0): missing value where TRUE/FALSE needed

I have not encountered this error in several other (much smaller) samples I have tried, so is this related to the dataset being too large?

Traceback:
1. scDblFinder(peak_assay, aggregateFeatures = TRUE, nfeatures = 25, 
 .     processing = "normFeatures")
2. .scDblscore(d, scoreType = score, addVals = pca[, includePCs, 
 .     drop = FALSE], threshold = threshold, dbr = dbr, dbr.sd = dbr.sd, 
 .     nrounds = nrounds, max_depth = max_depth, iter = iter, BPPARAM = BPPARAM, 
 .     features = trainingFeatures, verbose = verbose, metric = metric, 
 .     filterUnidentifiable = removeUnidentifiable, unident.th = unident.th)
3. which((d$type == "real" & doubletThresholding(d, dbr = dbr, dbr.sd = dbr.sd, 
 .     stringency = 0.7, perSample = perSample, returnType = "call") == 
 .     "doublet") | (d$type == "doublet" & d$score < unident.th & 
 .     filterUnidentifiable) | !d$include.in.training)
4. doubletThresholding(d, dbr = dbr, dbr.sd = dbr.sd, stringency = 0.7, 
 .     perSample = perSample, returnType = "call")
5. .optimThreshold(d, dbr = .gdbr(d, dbr), dbr.sd = dbr.sd, stringency = stringency)
6. optimize(totfn, c(0, 1), maximum = FALSE)
7. (function (arg) 
 . f(arg, ...))(0.381966011250105)
8. f(arg, ...)
9. .prop.dev(d$type, d$score, expected, x)

Session info

`R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/envs/NET_R_env/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3                  
 [3] Biostrings_2.66.0                 XVector_0.38.0                   
 [5] CopyscAT_0.40                     MASS_7.3-60                      
 [7] jsonlite_1.8.4                    sp_1.6-1                         
 [9] rtracklayer_1.58.0                gplots_3.1.3                     
[11] tibble_3.2.1                      tidyr_1.3.0                      
[13] edgeR_3.40.2                      limma_3.54.2                     
[15] stringr_1.5.0                     mclust_6.0.0                     
[17] changepoint_2.2.4                 zoo_1.8-12                       
[19] data.table_1.14.8                 igraph_1.4.3                     
[21] FNN_1.1.3.2                       Rtsne_0.16                       
[23] biomaRt_2.54.0                    fastcluster_1.2.3                
[25] NMF_0.26                          cluster_2.1.4                    
[27] rngtools_1.5.2                    registry_0.5-1                   
[29] viridis_0.6.3                     viridisLite_0.4.2                
[31] dplyr_1.1.2                       RColorBrewer_1.1-3               
[33] scDblFinder_1.13.13               SingleCellExperiment_1.20.1      
[35] SummarizedExperiment_1.28.0       MatrixGenerics_1.10.0            
[37] matrixStats_0.63.0                glue_1.6.2                       
[39] ggplot2_3.4.1                     EnsDb.Hsapiens.v86_2.99.0        
[41] ensembldb_2.22.0                  AnnotationFilter_1.22.0          
[43] GenomicFeatures_1.50.2            AnnotationDbi_1.60.0             
[45] Biobase_2.58.0                    GenomicRanges_1.50.2             
[47] GenomeInfoDb_1.34.9               IRanges_2.32.0                   
[49] S4Vectors_0.36.2                  BiocGenerics_0.44.0              
[51] Signac_1.10.0                     SeuratObject_4.1.3               
[53] Seurat_4.3.0                     

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3            pbdZMQ_0.3-9             
  [3] scattermore_1.0           bit64_4.0.5              
  [5] irlba_2.3.5.1             DelayedArray_0.24.0      
  [7] KEGGREST_1.38.0           RCurl_1.98-1.12          
  [9] doParallel_1.0.17         generics_0.1.3           
 [11] ScaledMatrix_1.6.0        cowplot_1.1.1            
 [13] RSQLite_2.2.20            RANN_2.6.1               
 [15] future_1.32.0             bit_4.0.5                
 [17] spatstat.data_3.0-1       xml2_1.3.4               
 [19] httpuv_1.6.11             hms_1.1.3                
 [21] evaluate_0.21             promises_1.2.0.1         
 [23] fansi_1.0.4               restfulr_0.0.15          
 [25] progress_1.2.2            caTools_1.18.2           
 [27] dbplyr_2.3.1              DBI_1.1.3                
 [29] htmlwidgets_1.6.2         spatstat.geom_3.2-1      
 [31] purrr_1.0.1               ellipsis_0.3.2           
 [33] gridBase_0.4-7            deldir_1.0-9             
 [35] sparseMatrixStats_1.10.0  vctrs_0.6.2              
 [37] ROCR_1.0-11               abind_1.4-5              
 [39] cachem_1.0.8              withr_2.5.0              
 [41] progressr_0.13.0          sctransform_0.3.5        
 [43] GenomicAlignments_1.34.1  prettyunits_1.1.1        
 [45] scran_1.26.2              goftest_1.2-3            
 [47] IRdisplay_1.1             lazyeval_0.2.2           
 [49] crayon_1.5.2              spatstat.explore_3.2-1   
 [51] pkgconfig_2.0.3           nlme_3.1-162             
 [53] vipor_0.4.5               ProtGenerics_1.30.0      
 [55] rlang_1.1.0               globals_0.16.2           
 [57] lifecycle_1.0.3           miniUI_0.1.1.1           
 [59] filelock_1.0.2            BiocFileCache_2.6.0      
 [61] rsvd_1.0.5                polyclip_1.10-4          
 [63] lmtest_0.9-40             Matrix_1.5-4             
 [65] IRkernel_1.3.2            base64enc_0.1-3          
 [67] beeswarm_0.4.0            ggridges_0.5.4           
 [69] png_0.1-8                 rjson_0.2.21             
 [71] bitops_1.0-7              KernSmooth_2.23-21       
 [73] blob_1.2.3                DelayedMatrixStats_1.20.0
 [75] parallelly_1.35.0         spatstat.random_3.1-5    
 [77] beachmat_2.14.2           scales_1.2.1             
 [79] memoise_2.0.1             magrittr_2.0.3           
 [81] plyr_1.8.8                ica_1.0-3                
 [83] zlibbioc_1.44.0           compiler_4.2.2           
 [85] dqrng_0.3.0               BiocIO_1.8.0             
 [87] fitdistrplus_1.1-11       Rsamtools_2.14.0         
 [89] cli_3.6.1                 listenv_0.9.0            
 [91] patchwork_1.1.2           pbapply_1.7-0            
 [93] tidyselect_1.2.0          stringi_1.7.12           
 [95] yaml_2.3.7                BiocSingular_1.14.0      
 [97] locfit_1.5-9.7            ggrepel_0.9.3            
 [99] grid_4.2.2                fastmatch_1.1-3          
[101] tools_4.2.2               future.apply_1.10.0      
[103] parallel_4.2.2            uuid_1.1-0               
[105] bluster_1.8.0             foreach_1.5.2            
[107] metapod_1.6.0             gridExtra_2.3            
[109] digest_0.6.31             BiocManager_1.30.20      
[111] shiny_1.7.4               Rcpp_1.0.10              
[113] scuttle_1.8.4             later_1.3.1              
[115] RcppAnnoy_0.0.20          httr_1.4.5               
[117] colorspace_2.1-0          XML_3.99-0.14            
[119] tensor_1.5                reticulate_1.28          
[121] splines_4.2.2             uwot_0.1.14              
[123] RcppRoll_0.3.0            statmod_1.5.0            
[125] spatstat.utils_3.0-3      scater_1.26.1            
[127] xgboost_1.7.5.1           plotly_4.10.1            
[129] xtable_1.8-4              R6_2.5.1                 
[131] pillar_1.9.0              htmltools_0.5.5          
[133] mime_0.12                 fastmap_1.1.1            
[135] BiocParallel_1.32.6       BiocNeighbors_1.16.0     
[137] codetools_0.2-19          utf8_1.2.3               
[139] lattice_0.21-8            spatstat.sparse_3.0-1    
[141] curl_4.3.3                ggbeeswarm_0.7.2         
[143] leiden_0.4.3              gtools_3.9.4             
[145] survival_3.5-5            repr_1.1.6               
[147] munsell_0.5.0             GenomeInfoDbData_1.2.9   
[149] iterators_1.0.14          reshape2_1.4.4           
[151] gtable_0.3.3             `

plger commented 1 year ago

Hi, 20k cells certainly shouldn't be a problem, how many peaks do you have? Do you have mbkmeans installed? Can you reproduce the error without the aggregation (i.e. just scDblFinder(peak_assay) ) ? I've never seen this error, so if it'd be possible to share the object (ideally smaller if you can still reproduce the error, genes/samples can obviously be scrambled) it'll make debugging easier. Pierre-Luc

plger commented 1 year ago

Hi @twoneu , please respond or I'll close the issue.

nhj4064 commented 1 year ago

I also have same error. Did you solve it? I have also just 20000 cell( in cellranger websummary). but My another data have also 20000cell. but it ran fine without error.

plger commented 1 year ago

Then please provide the extra info requested above.

nhj4064 commented 1 year ago

Owner counts <- Read10X_h5("/data/jrgong/AD_multiome/fastq/aggr_2023_0616/aggr_2023_0616/outs/filtered_feature_bc_matrix.h5") fragpath <- "/data/jrgong/AD_multiome/fastq//aggr_2023_0616/aggr_2023_0616/outs/atac_fragments.tsv.gz"

get gene annotations for hg 38

annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

seqlevelsStyle(annotation) <- "UCSC"

create a Seurat object containing the RNA data

AH_Mul <- CreateSeuratObject(counts = counts$Gene Expression, assay = "RNA")

create a Seurat object containing the ATAC data

AH_Mul[["ATAC"]] <- CreateChromatinAssay(counts = counts$Peaks, sep = c(":","-"), fragments = fragpath, annotation = annotation)

AH.PBMC.7 <- subset(AH_Mul, subset=aggr_number=="7")

ce <- scDblFinder(SingleCellExperiment(list(counts=AH.PBMC.7@assays$RNA@counts))) AH.PBMC.7$doublet_scores <- sce$scDblFinder.score AH.PBMC.7$doublet_class <- sce$scDblFinder.class

and then I saw the error like this :

Creating ~16000 artificial doublets...

Dimensional reduction

Evaluating kNN...

Training model...

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0) :

missing value where TRUE/FALSE needed

could you let me know how to solve it?

plger commented 1 year ago

Can you please report your session info (as one should always do)?

twoneu commented 1 year ago

Hi @plger, sorry for the delay! What is the best way to share the data with you?

plger commented 1 year ago

If you don't have a drive/platform where you can put it, send me an email at pierre-luc.germain [at ] hest.ethz.ch and I'll give you a link.

twoneu commented 1 year ago

Thank you @plger for helping me solve this issue! I was able to successfully run scDblFinder by:

installing the mbkmeans package
Rerunning scDblFinder once the package was installed.

alexlenail commented 1 week ago

installing mbkmeans did not solve the issue for me. I'm still getting:

Creating ~25000 artificial doublets...
Dimensional reduction
Evaluating kNN...
Training model...
Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0) :
  missing value where TRUE/FALSE needed
Calls: scDblFinder ... .optimThreshold -> optimize -> <Anonymous> -> f -> .prop.dev
In addition: Warning message:
In scDblFinder(sce) :
  You are trying to run scDblFinder on a very large number of cells. If these are from different captures, please specify this using the `samples` argument.TRUE
Execution halted

plger commented 1 week ago

Please report session info.

alexlenail commented 1 week ago

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: GridOS 22.04.5

Matrix products: default
BLAS/LAPACK: /home/gridsan/lenail/.conda/envs/SoupX/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
[1] C

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] Matrix_1.6-5                scDblFinder_1.16.0
 [3] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
 [5] Biobase_2.62.0              GenomicRanges_1.54.1
 [7] GenomeInfoDb_1.38.1         IRanges_2.36.0
 [9] S4Vectors_0.40.2            BiocGenerics_0.48.1
[11] MatrixGenerics_1.14.0       matrixStats_1.4.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1          viridisLite_0.4.2
 [3] dplyr_1.1.4               vipor_0.4.7
 [5] viridis_0.6.5             Biostrings_2.70.1
 [7] bitops_1.0-9              RCurl_1.98-1.16
 [9] bluster_1.12.0            GenomicAlignments_1.38.0
[11] XML_3.99-0.17             rsvd_1.0.5
[13] lifecycle_1.0.4           cluster_2.1.6
[15] statmod_1.5.0             magrittr_2.0.3
[17] compiler_4.3.3            rlang_1.1.4
[19] tools_4.3.3               igraph_2.0.3
[21] utf8_1.2.4                yaml_2.3.10
[23] data.table_1.15.4         rtracklayer_1.62.0
[25] S4Arrays_1.2.0            dqrng_0.3.2
[27] xgboost_2.1.1.1           DelayedArray_0.28.0
[29] abind_1.4-5               BiocParallel_1.36.0
[31] grid_4.3.3                fansi_1.0.6
[33] beachmat_2.18.0           colorspace_2.1-1
[35] edgeR_4.0.16              ggplot2_3.5.1
[37] scales_1.3.0              MASS_7.3-60.0.1
[39] cli_3.6.3                 crayon_1.5.3
[41] generics_0.1.3            metapod_1.10.0
[43] rjson_0.2.23              DelayedMatrixStats_1.24.0
[45] scuttle_1.12.0            ggbeeswarm_0.7.2
[47] zlibbioc_1.48.0           parallel_4.3.3
[49] XVector_0.42.0            restfulr_0.0.15
[51] vctrs_0.6.5               jsonlite_1.8.9
[53] BiocSingular_1.18.0       BiocNeighbors_1.20.0
[55] ggrepel_0.9.6             irlba_2.3.5.1
[57] beeswarm_0.4.0            scater_1.30.1
[59] locfit_1.5-9.9            limma_3.58.1
[61] glue_1.8.0                codetools_0.2-20
[63] gtable_0.3.5              BiocIO_1.12.0
[65] ScaledMatrix_1.10.0       munsell_0.5.1
[67] tibble_3.2.1              pillar_1.9.0
[69] GenomeInfoDbData_1.2.11   R6_2.5.1
[71] sparseMatrixStats_1.14.0  lattice_0.22-6
[73] Rsamtools_2.18.0          scran_1.30.0
[75] Rcpp_1.0.13               gridExtra_2.3
[77] SparseArray_1.2.2         pkgconfig_2.0.3

plger commented 1 week ago

If possible try using the latest version (e.g. installing from github). I made some changes that fix thresholding issues in some circumstances which could have lead to such an error, although I can't say whether it's the case here. If you still encounter the issue, please indicate the exact command you used.

alexlenail commented 1 week ago

I installed the github version. I'm still getting the same error.

Creating ~25000 artificial doublets...
Dimensional reduction
Evaluating kNN...
Training model...
Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0) :
  missing value where TRUE/FALSE needed
Calls: scDblFinder ... .optimThreshold -> optimize -> <Anonymous> -> f -> .prop.dev
In addition: Warning message:
In scDblFinder(sce) :
  You are trying to run scDblFinder on a very large number of cells. If these are from different captures, please specify this using the `samples` argument.TRUE
Execution halted

My script looks like

mat <- readMM(gzfile(matrix_file))
barcodes <- read.delim(gzfile(barcodes_file), header = FALSE)
features <- read.delim(gzfile(features_file), header = FALSE)

rownames(mat) <- features[, 1]
colnames(mat) <- barcodes[, 1]

sce <- SingleCellExperiment(assays = list(counts = mat))

sce <- scDblFinder(sce)

scDblFinder_out_dir <- file.path(sample_dir, "scDblFinder_soupx_results")
dir.create(scDblFinder_out_dir, showWarnings = FALSE)

doublet_results_file <- file.path(scDblFinder_out_dir, "doublet_results.tsv")
write.table(colData(sce), doublet_results_file, sep = "\t", quote = FALSE)

This script worked for 94/96 samples, but failed for samples F10 and G10. I suspect there's something about the data in those samples which is making scDblFinder crash. In what circumstances would scDblFinder crash this way?

plger commented 1 week ago

Thanks for trying it.

The known reasons so far were: 1) some issues related to feature aggregation with ATAC-seq data (solved by installing mbkmeans, above) - not your problem. 2) no significant anti-correlation between genes leading to flat cxds scores, which are normally use to bootstrap the iterative procedure. I've only seen this happen once in data where all cells looked the same, but anyway now there's a failsafe for that. 3) a too large fraction of doublets detected in the first training iteration, which was also solved in recent versions.

If you can share with me the matrix of one of the two samples (can be without the cells/feature IDs), I'll try to figure out what's happening in your case.

alexlenail commented 1 week ago

@plger I emailed my matrix to the address you provided above. Let me know if you do not receive it.

plger commented 5 days ago

Hi,

I got your data, thanks, and finally got to try it: I could run scDblFinder on it without problem. The scores have a nice bimodal distribution and all.

One thing occurred to me: are you sure that you did in fact install the latest scDblFinder? I.e. does packageVersion("scDblFinder") give you 1.19.7? I realized that if you only tried to install the github latest version without updating to bioc devel, it would fail because of some changes that were introduced in BiocNeighbors and which you don't have. So to get the bug fixes without getting the changes that require the upcoming Bioc version, you need to pull a specific commit, e.g. doing this: remotes::install_github("plger/scDblFinder", ref="6f82238ea97f393f6ef4c475f3a19ccb3a88898f")

Let me know if that fixes your issue,

Pierre-Luc

alexlenail commented 5 days ago

Ah!

> packageVersion("scDblFinder")
[1] '1.16.0'

Let me fix this and see what happens.

plger / scDblFinder

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0): missing value where TRUE/FALSE needed #78

get gene annotations for hg 38

create a Seurat object containing the RNA data

create a Seurat object containing the ATAC data

Creating ~16000 artificial doublets...

Dimensional reduction

Evaluating kNN...

Training model...

Error in if (length(expected) > 1 && x > min(expected) && x < max(expected)) return(0) :

missing value where TRUE/FALSE needed