Finding the right threshold for `call_doublets'

paulbrodersen commented 4 years ago

Hi @swolock beautiful package, everything just runs out of the box, the paper and code are very readable, and crucially, the doublet scores look pretty good (samples with high scores cluster in distinct regions of the UMAP manifold). However, your automated way of setting the threshold based on the scores of simulated samples seems to be fairly permissive, in my own data and in others. Furthermore, in the paper (at least the arxiv preprint), you seem to choose the threshold by eye yourself.

Personally, it looks to me like using scikit image's threshold_minimum isn't doing you any favours, and I wonder if there are other ways that might be better. Before I start trying a bunch of stuff, I wondered

if you would be willing to share what approaches you have tried so far, and/or
if you had any test data sets in a readily available format that you find particularly useful for testing any other approach. In particular, are there any data sets for which you have independent confirmation of doublets for which scikit image's function fails severely.

If I do come up with anything useful, I will make a PR, scout's honour.

swolock commented 4 years ago

Hi @paulbrodersen, thanks for the kind words and for your interest in improving the automated doublet calling (and my apologies for the severely delayed response – life has moved on a bit, but I would like to maintain and potentially improve Scrublet as long as it's useful).

I agree that Scrublet would greatly benefit from better automated thresholding, especially when it comes to projects involving many samples. I did try other thresholding methods from scikit-image and didn't have better luck with any of them. I also briefly thought about trying to incorporate the expected doublet rate more directly into the threshold setting (in theory this is an upper bound on the detected doublet fraction) but never implemented anything. If you have other ideas, I would be excited to chat about them and possibly help with trying them out.
I haven't played around with additional data sets with independent doublet identification, but it would definitely be worth doing now that there are more of them (demuxlet and related methods, the many flavors of cell hashing). If I get back to working on Scrublet and find any particularly useful data sets, I'll be sure to let you know.

paulbrodersen commented 4 years ago

Hi @swolock

No apologies needed. I also struggle to maintain all the code that I have released into the wild.

Re 1: I have another data set coming up shortly. I will revisit the doublet threshold calling then and if I come up with something, I will let you know. Might be a little while though.

Re 2: Thanks for the pointer to demuxlet. I had not come across that paper, yet (I am still pretty new to RNASeq).

I will close the issue for now to keep your issue tracker clean, and reference it if I make a PR.

swolock / scrublet

Finding the right threshold for `call_doublets' #13