Closed LTLA closed 4 years ago
good idea, thanks, will do that in the next days!
So the lines you suggested to replace with sumCountsAcrossCells
were not summing counts but averaging pca values. For the moment I just replaced names with indexes.
(I changed the other point to a direct bluster
call)
Didn't notice they were PCA values - in that case, you could shorten it down to:
#' @importFrom DelayedArray rowsum DelayedArray
y <- rowsum(DelayedArray(x), k) # DelayArray is only necessary if the reddims might not be ordinary matrices.
y / as.integer(table(k)[rownames(y)]) # subsetting is probably not necessary, but just to be safe.
This should be more efficient too, based on the comments in ?rowsum
examples.
will do. Why the DelayedArray(x)
? If we import it, shouldn't R use the DelayedArray rowsum method anyway iff that's what x
is?
If you believe that your reduced dimensions will only be ordinary matrices, then you can get rid of the DelayedArray()
(plus the import statements). But if you think that your reduced dimensions might be other matrix-like objects, like sparse matrices or whatever, then the wrapping in a DelayedArray()
(plus the imports) will ensure that those are handled correctly. base::rowsum
won't handle such weird objects correctly, and DelayedArray::rowsum
will only do so if the object is wrapped in a DA.
will only do so if the object is wrapped in a DA.
That's the bit I don't get and would like to understand. This all works fine:
m <- matrix(1:12, nrow=4)
rowsum(m, c(1,1,2,2))
m2 <- as(m,"dgCMatrix")
rowsum(m2, c(1,1,2,2))
m3 <- DelayedArray(m)
rowsum(m3, c(1,1,2,2))
(Also works fine with DelayedArray::rowsum)
...)
I'll have to make the whole scDblFinder DelayedArray-friendly...
Ah. Looks like DA added a rowsum.dgCMatrix
, for some reason. Probably because I nagged about it for so long.
Note that other matrices won't work, though:
m2.1 <- as(m, 'dgTMatrix')
rowsum(m2.1, c(1,1,2,2))
## Error in rowsum.default(m2.1, c(1, 1, 2, 2)) : 'x' must be numeric
Of course, none of this is relevant if you think that the reduced dims are going to be ordinary matrices, which they usually are. The DA stuff is primarily important for the assays. Good test is to see whether you can run scDblFinder on one of the small HDF5-backed PBMC datasets in TENxPBMCData without loading everything into memory.
I'm pretty sure it won't run off memory at the moment, I haven't done anything yet to make it DelayedArray-friendly. Now I'm optimizing a few details and DA will be next...
It seems that you could replace
https://github.com/plger/scDblFinder/blob/6bfd3b0ca74060c3af732e0abede25a77df3cc20/R/clustering.R#L45-L46
with something like:
using scuttle's
sumCountsAcrossCells
function. This implements parallelization if required, and is also safer whenx
does not have any names, in which casenames(k)
is probably not going to do the right thing.Also,
scran::buildKNNGraph
can be swapped for the lower-levelbluster::makeKNNGraph
, which does the same thing but doesn't need thed=
andtransposed=
arguments because your input should directly match up to what it wants. Again, this function can be augmented with parallelization viaBPPARAM=
, if one so pleases.