plger / scDblFinder

Methods for detecting doublets in single-cell sequencing data
https://plger.github.io/scDblFinder/
GNU General Public License v3.0
163 stars 17 forks source link

Miscellaneous clustering observations #26

Closed LTLA closed 4 years ago

LTLA commented 4 years ago

It seems that you could replace

https://github.com/plger/scDblFinder/blob/6bfd3b0ca74060c3af732e0abede25a77df3cc20/R/clustering.R#L45-L46

with something like:

x <- t(assay(sumCountsAcrossCells(x, k, average=TRUE)))

using scuttle's sumCountsAcrossCells function. This implements parallelization if required, and is also safer when x does not have any names, in which case names(k) is probably not going to do the right thing.

Also, scran::buildKNNGraph can be swapped for the lower-level bluster::makeKNNGraph, which does the same thing but doesn't need the d= and transposed= arguments because your input should directly match up to what it wants. Again, this function can be augmented with parallelization via BPPARAM=, if one so pleases.

plger commented 4 years ago

good idea, thanks, will do that in the next days!

plger commented 4 years ago

So the lines you suggested to replace with sumCountsAcrossCells were not summing counts but averaging pca values. For the moment I just replaced names with indexes. (I changed the other point to a direct bluster call)

LTLA commented 4 years ago

Didn't notice they were PCA values - in that case, you could shorten it down to:

#' @importFrom DelayedArray rowsum DelayedArray
y <- rowsum(DelayedArray(x), k) # DelayArray is only necessary if the reddims might not be ordinary matrices.
y / as.integer(table(k)[rownames(y)]) # subsetting is probably not necessary, but just to be safe. 

This should be more efficient too, based on the comments in ?rowsum examples.

plger commented 4 years ago

will do. Why the DelayedArray(x)? If we import it, shouldn't R use the DelayedArray rowsum method anyway iff that's what x is?

LTLA commented 4 years ago

If you believe that your reduced dimensions will only be ordinary matrices, then you can get rid of the DelayedArray() (plus the import statements). But if you think that your reduced dimensions might be other matrix-like objects, like sparse matrices or whatever, then the wrapping in a DelayedArray() (plus the imports) will ensure that those are handled correctly. base::rowsum won't handle such weird objects correctly, and DelayedArray::rowsum will only do so if the object is wrapped in a DA.

plger commented 4 years ago

will only do so if the object is wrapped in a DA.

That's the bit I don't get and would like to understand. This all works fine:

m <- matrix(1:12, nrow=4)
rowsum(m, c(1,1,2,2))
m2 <- as(m,"dgCMatrix")
rowsum(m2, c(1,1,2,2))
m3 <- DelayedArray(m)
rowsum(m3, c(1,1,2,2))

(Also works fine with DelayedArray::rowsum)...)

I'll have to make the whole scDblFinder DelayedArray-friendly...

LTLA commented 4 years ago

Ah. Looks like DA added a rowsum.dgCMatrix, for some reason. Probably because I nagged about it for so long.

Note that other matrices won't work, though:

m2.1 <- as(m, 'dgTMatrix')
rowsum(m2.1, c(1,1,2,2))
## Error in rowsum.default(m2.1, c(1, 1, 2, 2)) : 'x' must be numeric

Of course, none of this is relevant if you think that the reduced dims are going to be ordinary matrices, which they usually are. The DA stuff is primarily important for the assays. Good test is to see whether you can run scDblFinder on one of the small HDF5-backed PBMC datasets in TENxPBMCData without loading everything into memory.

plger commented 4 years ago

I'm pretty sure it won't run off memory at the moment, I haven't done anything yet to make it DelayedArray-friendly. Now I'm optimizing a few details and DA will be next...