plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

Error running scDblFinder #99

Closed dimitrisokolowskei closed 5 months ago

dimitrisokolowskei commented 6 months ago

Hi @plger,

While running scDblFinder function, I've found the following error:

seurat <- readRDS("data.rds") 
dim(seurat)
[1]  30791 277065

# Doublet Identification
set.seed(12345)
sce <- as.SingleCellExperiment(seurat)
sce <- scDblFinder(sce, sample="sample", BPPARAM=MulticoreParam(8))

Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  :
  convergence criterion below machine epsilon
Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  :
  did not converge--results might be invalid!; try increasing work or maxit

rro: BiocParallel errors
  1 remote errors, element index: 3
  0 unevaluated and other errors
  first remote error:
Error in value[[3L]](cond): An error occured while processing sample 'Stephenson et al., 2021':
Error in if (any(w <- knn$distance == 0)) knn$distance[w] <- min(knn$distance[knn$distance[, : valor ausente onde TRUE/FALSE necessário

I suspect that my dataset might be a little too large, since this error doesn't happen on smaller ones (e.g. < 100k cells). I tried to increase CPU cores to up 16, but I just run out of RAM (192gb). Also, this dataset is made from different datasets, with differences sizes each. I don't know if may be influencing on this error, but worth pointing that out.

I would appreciate any support or ideas on this issue. Thanks.

plger commented 6 months ago

Hi,

thanks for reporting. I doubt that's really related to the dataset size, as it's been run on much larger datasets, but could you report the sizes of each sample, i.e. table(sce$sample) ?

In addition it would be really helpful if you could run the following

sce <- scDblFinder(sce[,which(sce$sample=="Stephenson et al., 2021")])

and then, when the error occurs, run traceback() and report the output.

Thanks, plger

dimitrisokolowskei commented 6 months ago

Hi @plger,

Just a minor correction. In my previous post, I said I was using the sample column, but I'm actually using the study one. Heres the size of it:

table(sce$study)

name et al., 2019   name et al., 2019      Stephenson et al., 2021 
159138                    13248                   104679

Independently of that, both run in a similar error again:

> sce <- scDblFinder(sce[,which(sce$study=="Stephenson et al., 2021")])

Error in serialize(data, node$con, xdr = FALSE) : 
  erro ao escrever na conexão
Além disso: Warning message:
In scDblFinder(sce[, which(sce$study == "Stephenson et al., 2021")]) :
  You are trying to run scDblFinder on a very large number of cells. If these are from different captures, please specify this using the `samples` argument.TRUE
Creating ~25000 artificial doublets...
Dimensional reduction
Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  :
  convergence criterion below machine epsilon
Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  :
  did not converge--results might be invalid!; try increasing work or maxit
Evaluating kNN...
Error in if (any(w <- knn$distance == 0)) knn$distance[w] <- min(knn$distance[knn$distance[,  : 
  valor ausente onde TRUE/FALSE necessário
Além disso: Warning message:
In rpois(nrow(x) * length(wAd), as.numeric(as.matrix(x[, wAd]))) :
  NAs produzidos

Using traceback():

2: .evaluateKNN(pca, ctype, ado2, expected = ex, k = k)
1: scDblFinder(sce[, which(sce$study == "Stephenson et al., 2021")])

I'm starting to think that there's just something wrong with this Stephenson et al., 2021 dataset, since in downstream analysis like in sctransform() it incurs in errors that I've never seen before. If you have any further insights into this issue I would appreciate. Otherwise, I prone to remove this dataset and move on. Thanks for your help again.

Dimitri

plger commented 6 months ago

Hi,

thanks for the extra info.

First, the sample argument should be given the individual captures, rather than entire study (I'm assuming that the whole study, i.e. ~100k cells, was not done in a single capture). That in itself will already massively reduce the load. (While scDblFinder has been used with several hundreds of thousands of cells, it was typically 10x data, so individual captures were somewhere between 1-20k cells.) This being said, the error doesn't look like a memory issue.

Could you report the quantiles of library sizes for this Stephenson study? (e.g. quantile(colSums(counts(sce)[,which(sce$study == "Stephenson et al., 2021")])) if they're not already stored somewhere...)

Thanks,

Pierre-Luc

dimitrisokolowskei commented 6 months ago

Hi @plger,

I`m sorry for the late response. Heres what I got:

quantile(colSums(counts(sce)[,which(sce$study == "Stephenson et al., 2021")]))

       0%       25%       50%       75%      100% 
 305.6683 1915.5127 2190.4214 2477.9051 3979.5989 

Thanks once again

plger commented 6 months ago

Hi, worth checking issue 97, which had a similar error message. Specifically, check whether any(counts(sce)<0) .

dimitrisokolowskei commented 6 months ago

Thanks for pointing that, I`m going to check it out. I appreciate your assistance @plger

Best, Dimitri

plger commented 6 months ago

please let me know whether you had the same issue.