hms-dbmi-cellenics / issues

This repository is used to report and track issues
1 stars 0 forks source link

[BUG] scDblFinder incorrectly specifies artificial doublets #2

Closed alexvpickering closed 2 years ago

alexvpickering commented 2 years ago

Reported issue has been fixed: https://github.com/plger/scDblFinder/issues/57

This fixes the error in scDblFinder for the sample that you provided me @ogibson-biomage. Please let me know if you know of additional samples that failed so I can check them as well.

Not released yet AFAIK but let's wait a couple days. @plger mentioned he would push to BioC release after checks pass (thanks! :tada:) so should build on Bioconductor end within 48 hours ...

plger commented 2 years ago

It's been pushed to BioC devel, so it should normally be available from there in a couple of days. (Unclear to me why the discussion is of 'failed' samples, this shouldn't prevent the function for working and only has a very effect on classification)

ogibson commented 2 years ago

That's great, so this means we shouldn't need to iterate through the UMI thresholds in order to get these samples to work? @plger, just to clarify, we have a sample that causes scDblFinder to fail, and Alex has tracked the bug down to this issue. Hopefully this will be fixed once it's released!

alexvpickering commented 2 years ago

It's been pushed to BioC devel, so it should normally be available from there in a couple of days. (Unclear to me why the discussion is of 'failed' samples, this shouldn't prevent the function for working and only has a very effect on classification)

@plger The sample in question failed scDblFinder due to the call to cxds2 returning NaN's (from 0 divided by 0; see below):

https://github.com/plger/scDblFinder/blob/8a461d3e8c6960786414e4c3642fe0991ed6f090/R/misc.R#L316

...
s <- -Matrix::colSums(x * (S %*% x))
s <- s - min(s)
s/max(s)

It happens that correctly specifying whichDbls prevents this for the sample in question.

@ogibson had found another workaround, which was to increase our minimum UMI threshold for excluding droplets from 200 (based on warning threshold from scDblFinder:::.checkSCE) to 500 (relevant line here).

Failures for other samples have similarly been resolved by increasing the UMI threshold ... hopefully this will resolve the issue for those samples as well - if you suspect that to be the case or not @plger would love to hear

plger commented 2 years ago

I think the whichDbls change fixes your problem only accidentally. cxds2 is perfectly fine running with an empty whichDbls (that's what happens by default).

Your problem appears to be that, by default, cxds2 uses the top 500 features, and with the >200 UMIs filtering you have cells that don't have any reads in any of those 500 features. If the whichDbls fixes it, then I assume it's because it happens to exclude these cells (I guess you're providing knownDoublets?). But the bigger problem, I think, is droplets that are near-empty.

I'll change cxds2 so that at least that part is tolerant to empty cells, although personally I wouldn't trust droplets with 200 UMIs. If you continue working with cells with such low library sizes, however, it's likely that you'll encounter similar problems elsewhere in the scDblFinder method. If this is the case, one solution would be to select the features yourself (e.g. by subsetting the object), ensuring that all cells have some reads in some features. This being said, I still think you'd be better off filtering these cells out, their normalization is bound to be very inaccurate and they're not bringing much in...

plger commented 2 years ago

actually the only way I'm able to get cxds2 to produce NAN values is if there are NA values in the counts... @alexvpickering if you have a minimal example let me know!