plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

doublets mostly differ between scDblFinder and doubletFinder #67

Closed gianfilippo closed 1 year ago

gianfilippo commented 1 year ago

Hi,

I tested your algorithm and doubletFinder on a single 10X PBMC sample of about 7600 cells (after some basic filtering).

I do not see a lot of overlap between the two. See below the output from table(doubletFinder,scDblFinder) scDblFinder doubletFinder doublet singlet Doublet 124 317 Singlet 340 6983

scDblFinder call: pbmc.sce <- scDblFinder(pbmc.sce, clusters="res.1.2",dbr=0.06, dims=50)

doubletFinder_v3 call: pbmc <- doubletFinder_v3(pbmc, PCs = 1:50, pN = 0.25, pK = 0.02, nExp = nExp_poi.adj, reuse.pANN = F, sct = T) pbmc <- doubletFinder_v3(pbmc, PCs = 1:50, pN = 0.25, pK = 0.02, nExp = nExp_poi.adj, reuse.pANN = "pANN_0.25_0.02_466", sct = T)

Is this expected ? did I make a mistake in the calls ?

Thanks

plger commented 1 year ago

Hi,

there wouldn't be much of an incentive to develop a new method if it gave the same results as existing ones :)

That's still a very significant overlap, but if you wanted to increase it you could first run scDblFinder without the clusters (random mode), since DoubletFinder also generates artificial doublets randomly. In the benchmarks it's not clear whether the cluster-based or random mode if best (both are generally superior to DoubletFinder), but random seems to perform better for more complex datasets datasets.

Another important difference not strictly within the method itself is the processing (e.g. you're using sctransform with DoubletFinder). To do something similar with scDblFinder see the "Can I use this in combination with Seurat or other tools?" of the vignette.

But beside these differences, fundamentally the algorithms differ (the ratio of artificial doublets in the neighborhood, which DoubletFinder uses, is only one of the features used by scDblFinder) and hence will only converge on the easy doublets.

gianfilippo commented 1 year ago

Hi,

definitely :)

I did a rerun with random mode and ended up with 5-10% fewer doublets. I will probably stick with this, as I can use it well before any clustering.

I am mostly interested in how much overlap you are expecting, as I was wondering whether it makes sense to use a combination of the two predictions.

I looked at he SCT related info, thanks for pointing it out. How can I implement the SCT transform as custom function ? would it make really a difference ?

Would you expect an issue in datasets with cell overloading (i.e. larger than recommended input number of cells) ?

Thanks

plger commented 1 year ago

I've seen some attempts at combining the methods, but there wasn't a clear improvement, although I could still try it on the 16 benchmark datasets to be sure.

I doubt that SCT makes a very big difference for this purpose, but to be honest I haven't seriously tested it (will also add it to my to-do). You could implement it like this (code on top of my head, haven't tested) :

myfun <- function(e, dims){
  vst <- sctransform::vst(e)
  hvg <- order(vst$gene_attr$residual_variance, decreasing=TRUE)[1:3000]
  scater::calculatePCA(vst$y, subset_row=hvg, ncomponents=dims)
}

sce <- scDblFinder(sce, ..., processing=myfun)

Overloading shouldn't be a problem, you'll just have more doublets - just make sure that the dbr is set accordingly or left null (so that it's estimated from the cell number).

gianfilippo commented 1 year ago

Hi,

I look forward to seeing the benchmark on a combo approach.

Thanks for the example.

I missed the capability of the function to be able to estimate dbr. I will give it a try.

Thanks for all your help!

plger commented 1 year ago

Hi, I just ran the benchmark on two simple combinations of DoubletFinder and scDblFinder(random), the mean of the scores and the fisher p-value aggregation of 1-score (both are similar). As you can see attached it does provide (generally mild) improvements on most datasets (11/16 improved, vs only 4 worst) and overall (.01 increase in mean AUPRC). So it does sound like it can be worth it if the datasets are not excessively large (since DoubletFinder is sort of slow). scDblFinder_DoubletFinder_combinations (The values printed are the AUPRC, while the colors are per-dataset z-scores of those values. The DoubletFinder scores are the one from the version used in my paper, I didn't re-run the analyses) Best,

gianfilippo commented 1 year ago

Hi,

thanks for looking into this!!

Did you have a chance to see how your algorithm works with intermediate states, I mean, if it scores them as doublets ?

Thanks

On Thu, Dec 15, 2022 at 2:18 PM Pierre-Luc @.***> wrote:

Hi, I just ran the benchmark on two simple combinations of DoubletFinder and scDblFinder(random), the mean of the scores and the fisher p-value aggregation of 1-score (both are similar). As you can see attached it does provide (generally mild) improvements on most datasets (11/16 improved, vs only 4 worst) and overall (.01 increase in mean AUPRC). So it does sound like it can be worth it if the datasets are not excessively large (since DoubletFinder is sort of slow). [image: scDblFinder_DoubletFinder_combinations] https://user-images.githubusercontent.com/9786697/207947829-a3c3209c-adee-49f2-baa8-332e119fde9d.png (The values printed are the AUPRC, while the colors are per-dataset z-scores of those values. The DoubletFinder scores are the one from the version used in my paper, I didn't re-run the analyses) Best,

— Reply to this email directly, view it on GitHub https://github.com/plger/scDblFinder/issues/67#issuecomment-1353590091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVCSRAG3NSPPWIAFA4LWNNVHJANCNFSM6AAAAAASMIY7ME . You are receiving this because you authored the thread.Message ID: @.***>

plger commented 1 year ago

I usually work with adult brain tissues so it's not the usual scenario in which this arises and I only have anectodic evidence, but I remember a case where we had two small populations between oligodendrocytes and oligodendrocyte progenitors, one of which had high doublet scores and the other not. In general, the genuine intermediate states I've seen were not truly a linear combination of the start and end state (i.e. were so only for a subset of the genes), so I'd tend to think that they're not flagged as doublets, but I can't exclude that it could happen, especially if they're very rare.

gianfilippo commented 1 year ago

Ok,

Thanks a lot for your help!

On Fri, Dec 16, 2022 at 8:56 AM Pierre-Luc @.***> wrote:

I usually work with adult brain tissues so it's not the usual scenario in which this arises and I only have anectodic evidence, but I remember a case where we had two small populations between oligodendrocytes and oligodendrocyte progenitors, one of which had high doublet scores and the other not. In general, the genuine intermediate states I've seen were not truly a linear combination of the start and end state (i.e. were so only for a subset of the genes), so I'd tend to think that they're not flagged as doublets, but I can't exclude that it could happen, especially if they're very rare.

— Reply to this email directly, view it on GitHub https://github.com/plger/scDblFinder/issues/67#issuecomment-1354840955, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVCILDOGTP2DYUEZV33WNRYJNANCNFSM6AAAAAASMIY7ME . You are receiving this because you authored the thread.Message ID: @.***>

plger commented 1 year ago

Hi again, so I've tested the sctransform variant, it improves the calls in only a minority of datasets:

image

scDblFinder.HVGsctransform refers to the normal scDblFinder but using sctransform's HVGs. scDblFinder.sctransform additionally uses sctransform's variance-stabilized expression values for the reduced dimension, kNN & so on (code here, I'll be adding a note in the vignette)

Even if I myself praised sctransform, this isn't completely a surprise. As @const-ae showed in his most excellent discussion of scRNA transformation methods, sctransform tends to smudge the bimodality of marker genes, so I'd be tempted to speculate that this might be the reason for the effect we're seeing here.

(Note that the normal scDblFinder performance results are different from the previous due to alterations in the default parameters in the most recent versions)

gianfilippo commented 1 year ago

Thanks for looking into it. Definitely useful!

plger commented 1 year ago

Note of this has been added to the vignette, so will close the issue now.

ccruizm commented 1 year ago

Hello @plger! Loved this comparison you made with your tool and DoubletFinder (and the improvement in some instances when combining both). Would it be possible to share the code on how you integrate the scores of both tools and then filter the cells based on that, please? I would like to see if it enhances the detection of doublets in my dataset.

Thanks in advance!

plger commented 1 year ago

the benchmark which runs the two methods is here (from the paper), and the combinations are made and compared here.

ccruizm commented 1 year ago

Thanks for pointing out where the code is! I will have a look.

ccruizm commented 1 year ago

Hello @plger,

I have run both tools on my dataset. I have the F1 score and mean as you performed in your comparison. However, I am lost on how to use the scores yielded by each and call doublets. I don't know which cells are truly doublets so I can not have an accurate caulcation of the AUC. How would you recommend to set a threshold to call doublets using this approach?

This is how the scoring looks like for my dataset: DoubletFinder

Screenshot 2023-03-21 at 22 02 09

scDblFinder

Screenshot 2023-03-21 at 22 02 21

Comparison

Screenshot 2023-03-21 at 22 02 34

Thanks in advance and appreciate any feedback you can provide.

plger commented 1 year ago

While the scDblFinder scores have an easy interpretation and scDblFinder has its own thresholding procedure, that's not the case for DoubletFinder, let alone for the combination of the two. If you really want to use the combination (given the marginal improvement it brings) you'll have to rely on an arbitrary cutoff or the expected number of doublets (e.g. 1% per 1000 cells captured).

ccruizm commented 1 year ago

Thanks for the advice! I will play around with it.

uqnsarke commented 1 year ago

how to call only "singlets" from scDblFinder ?