plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
153 stars 18 forks source link

Request for clarification of dbr and dbr.sd in scDblFinder 1.15.1 #80

Closed aghr closed 1 year ago

aghr commented 1 year ago

Dear scDblFinder Team,

Could you please help me to clarify the usage of the parameters dbr and dbr.sd of function scDblFinder().

  1. dbr: Setting dbr=0.01 reflects the assumption that 1% of 1k cells are doublets. The user should set dbr according to this scheme wrt 1k cell. scDblFinder then would increase dbr internally if the data set at hand consists of much more cells. This adaptation of dbr happens automatically. Is that right?
  2. dbr.sd: In the help message of scDblFinder() I find: "Set to dbr.sd=0 to disable." The GitHub README.md reads: "If you are unsure about the doublet rate, set dbr.sd=1 and the thresholding will be entirely based on the misclassification rates." The idea of both seems to disable dbr.sd. Can the user disable dbr.sd by setting dbr.sd=0 or dbr.sd=1 or through both ways?

Many thanks. Andre

plger commented 1 year ago

Hi,

  1. If you don't set dbr, then internally it will be set based on the number of cells and the 1%/1k cells rule. However is you set dbr manually, then this rate will be used as is, i.e. it won't be scaled with the number of cells.
  2. You're right that was ambiguous, I've now updated the help to clarify this. Setting dbr.sd=0 will disable the uncertainty around the doublet rate, while setting to dbr.sd=1 will increase the uncertainty to the point of disabling the doublet rate altogether (thus letting the thresholding be entirely driven by the misclassification of artificial doublets).

Hope this helps, plger

aghr commented 1 year ago

Thank you very much. I'd have another related question wrt. your point 1. Wouldn't that algorithm run into problems with very large data sets, say of more than 100k cells leading to dbr values greater than 1 (100%)? I expect such data sets to become common at some point. 10X announced a 1.3-Mio-cells data set in 2017 .

Thanks a lot again. Andre

plger commented 1 year ago

Such large datasets are produced in multiple captures, so that each capture has only 12k cells or so. As indicated in the documentation, different captures should be processed separately in scDblFinder, for example using the samples argument, because the number of cells inputted in the machine in a given capture is the actual determinant of the expected number of doublets.

plger commented 1 year ago

If this answered your question, please close the issue. Best,