theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Convert dgRMatrices to dgCMatrices on the R side #34

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Noticed that readH5AD will sometimes (maybe always?) emit dgRMatrix objects. These are pretty unusual (and useless), we should coerce them into dgCMatrix objects instead.

I notice that the sparse format is already column sparse compressed, so it may be more efficient to avoid a round-trip through Python (and converting to/from the row-format) for the assay data (a la the proposed solution to #32).

LTLA commented 3 years ago

My second point is moot, the H5AD sparse format is stored as row-based, so a conversion would be required either way.

ivirshup commented 3 years ago

anndata can write either, it will be noted which one it is in the group attributes, and they do have a use. If you're interested in reading a single gene for all cells (say, to display them) then it's a useful way to store them. Other cases include storing multi label classes, where it's often efficient to use each label as a mask over the dataset.

LTLA commented 3 years ago

Yes, but dgRMatrix objects are particularly useless, because Matrix will happily convert them into dgCMatrix objects before applying various operations like rowSums and colSums, etc.

ivirshup commented 3 years ago

It will convert them before rowSums? That seems pull request worthy.

I think they can still be worth it for the speed of indexing operations (assuming those are implemented efficiently).

LTLA commented 3 years ago

To be precise: for rowSums, the conversion is done in a light manner by simply changing the class and flipping the dimensions; then you can compute the row sums of the original as the column sums of the converted object, without any re-indexing involved. Not that it really matters, because you could compute the row sums of a dgCMatrix in a single pass without any binary searches, so here it's just a matter of convenience to avoid a redundant implementation.

However, in other cases (e.g., matrix multiplication), dgRMatrix objects are explicitly converted to dgCMatrix objects. I'd imagine there to be a lot more operations that do this in Matrix, and ?"dgRMatrix-class" does strongly recommend use of dgCMatrix objects, so it seems best to not rock the boat here.

I think they can still be worth it for the speed of indexing operations (assuming those are implemented efficiently).

Perhaps, if you know that your subsequent access pattern is on a row-by-row basis. This can be guaranteed in controlled settings like interactive applications that pull out one or two genes at a time (e.g., to color a t-SNE), but not in a typical analysis where the best access pattern changes across steps and/or depending on which side is used for matrix multiplication (e.g., in rSVD).

Mind you, the same problem applies for dgCMatrix with respect to guaranteed column access, but at least we don't have to worry about proper support for operations.

Incidentally, optimizing the read access is the motivation behind https://github.com/LTLA/DualIndexedMatrix.

ivirshup commented 3 years ago

you could compute the row sums of a dgCMatrix in a single pass

Single pass, but poor memory locality/ SIMD friendliness. Good to know that there is an efficient transpose though, I wasn't sure if this would be the case given R's memory semantics.

However, in other cases (e.g., matrix multiplication), dgRMatrix objects are explicitly converted to dgCMatrix

I wonder how common this is. I would hope dgRMatrix ⋅ dgRMatrix or dgRMatrix + dgRMatrix would still be efficient. Unfortunately I have completley forgotten how to look up S4 method definitions and advanced R has been rewritten with fewer s4 examples, so I'm not even sure where to look anymore.

Is there a way to look up what method would be called for a given signature? I'm trying to find which method is called for rowSums on a dgR matrix. In julia, this would be something like methods(sum, Tuple{SparseMatrixCSC})) or which(sum, Tuple{SparseMatrixCSC}).

Perhaps, if you know that your subsequent access pattern is on a row-by-row basis.

In scanpy, I would only put a compressed-variable-axis matrix in an object if I expected that to be the access pattern.

Incidentally, optimizing the read access is the motivation behind https://github.com/LTLA/DualIndexedMatrix.

Ahh, cool. This is essentially keeping a CSC and CSR copy of the matrix, and using whichever is faster based on the index passed?

I've discussed something similar with cellxgene people in the past.

LTLA commented 3 years ago

Is there a way to look up what method would be called for a given signature?

library(Matrix)
selectMethod("%*%", c("dgRMatrix", "matrix"))

Gives me:

Method Definition:

function (x, y) 
.Call(Csparse_dense_prod, as(x, "CsparseMatrix"), y, " ")
<bytecode: 0x560053290798>
<environment: namespace:Matrix>

Signatures:
        x              y       
target  "dgRMatrix"    "matrix"
defined "sparseMatrix" "matrix"

So basically, all sparse matrices (including triplet-form) are converted into dgCMatrix objects before multiplication proceeds.

In scanpy, I would only put a compressed-variable-axis matrix in an object if I expected that to be the access pattern.

Personally, I find it quite difficult to predict the access pattern. I used to make similarly bold claims for pure row- (or column) chunks with dense HDF5 datasets, which turned out to be a mistake when those files encountered users who were not myself. Or indeed, a week later, I would be wondering why past me didn't have the foresight to see I needed the other access pattern.

This is essentially keeping a CSC and CSR copy of the matrix, and using whichever is faster based on the index passed?

Yep. Or any "fast row" and "fast column" implementation, e.g., row and column HDF5 chunks.

ivirshup commented 3 years ago

Ah, selectMethod, thanks!

Personally, I find it quite difficult to predict the access pattern.

Also true. But isn't that what this issue is about? To me, the user passed a matrix of a particular format. I assume if they cared about performance it's in the format they want. Why not be as faithful about the conversion as possible?

I would say, for cases like labelings, I know downstream functions will convert it back to variable-axis-sparse (so, CSC for us) if it isn't. So I'm fine with methods adding these to the object.

LTLA commented 3 years ago

The most obvious reason is that the dgRMatrix is clearly a second-class citizen for downstream operations. Sure, if all you care about is extracting rows, it may make sense to preserve the row format. But users will be expecting to use the SCE generated from, e.g., readH5AD for further analyses, and such cases, dgCMatrixes are better supported. If we kept dgRMatrixes, there is a good chance that they would get implicitly converted into dgCMatrixes in each operation, over and over. For example, all my sparse-optimized C++ code only considers dgCMatrix objects because they are the dominant sparse species.

If we must retain the original class, the solution is to wrap it in a DelayedArray, which uses block processing to avoid the conversion to a dgCMatrix. This adds another layer of indirection, though, and even then, I'm not sure how well DelayedArray recognizes dgRMatrix objects as being sparse. (Just checked, actually, and it doesn't.)

Simply put, dgRMatrix objects are just not commonly used, and as much as I would prefer to defer responsibility for the format choice to the user, this is a case where we really need to align to the expectations of the rest of the ecosystem that consumes the SCEs that zellkonverter will produce.

hpages commented 3 years ago

I'm not sure how well DelayedArray recognizes dgRMatrix

Now it does :-) see https://github.com/Bioconductor/DelayedArray/commit/6d02fbc1492bf64d73a361a8e40f6e42b058d226

LTLA commented 3 years ago

Yes, I saw your commit on my feed. I was wondering whether you were psychic or had just stumbled onto this thread.

lazappi commented 3 years ago

I actually got bitten by this yesterday. zellkonverter gave me back a dgRMatrix which I transposed and got a dgTMatrix which was converted back to some scipy Python format which the AnnData write_h5ad() function didn't like. Converting to dgCMatrix at the start fixed it.

This is mostly my fault for doing kinda dodgy stuff but it does show there can be unexpected issues with dgRMatrix I guess.

ivirshup commented 3 years ago

@LTLA, perhaps a keyword argument for conversion? And a warning if no argument was passed but you think conversion should happen?

Simply put, dgRMatrix objects are just not commonly used, and as much as I would prefer to defer responsibility for the format choice to the user

I think this is the case in AnnData as well. It's largely because of how rare these matrices are that I would assume their presence was intentional.

lazappi commented 3 years ago

Yeah I think the problem here is that something to do with how reticulate handles the matrix conversion changes the format (possibly because they are transposed). So even if you have the standard format in the original AnnData you end up with something non-standard after conversion.

LTLA commented 3 years ago

@LTLA, perhaps a keyword argument for conversion? And a warning if no argument was passed but you think conversion should happen?

Hm. That's one possibility, and we already provide backed=TRUE as an option, so I suppose we could also do something like to_csparse= that can be turned off if the user knows better. I would prefer to avoid too much chatter in the output, though.

That said, the latest DelayedArray support for dgRMatrix objects may provide a solution to all our problems. This will:

This will require some testing. I think we should set up some longtests with as many exotic H5AD files as we can get our grubby mitts on, and see what explodes and what doesn't. Would also be a good way to test the native reader as well.

lazappi commented 3 years ago

This will require some testing. I think we should set up some longtests with as many exotic H5AD files as we can get our grubby mitts on, and see what explodes and what doesn't. Would also be a good way to test the native reader as well.

Yes the long tests would be ideal. I was supposed to set up an ExperimentHub package to store the data but I didn't get around to it for this release.

LTLA commented 3 years ago

No need to wait for the EHub package, at least not for the time being. Just have the longtests pull from public sources with:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
path <- bfcrpath(bfc, "https://some_url_to_h5ad_file")

# testing starts here.
zellkonverter::readH5AD(path) # etc.

This will cache the file locally so that future tests will automatically skip the download.

mathewchamberlain commented 1 year ago

Just wanted to leave a pretty simple solution. Lets say D is the h5ad object you just read in, and D@assays@data$X is your dgRMatrix. To convert it to dgCMatrix, just multiply the matrix by the identity matrix:

x = Matrix(data = 0, ncol = ncol(D), nrow = ncol(D)) diag(x) <- 1 D@assays@data$X <- D@assays@data$X %*% x

class(D@assays@data$X) will now be dgCMatrix

Marwansha commented 6 months ago

i am having an issue after readH5AD where my matrix is Delayed Matrix ,what is the most efficient way to convert it into a dgCMatrix as when i try using the code below it give an error

> sce@assays@data@listData$counts%>%str()
Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot
  ..@ seed:Formal class 'DelayedSetDimnames' [package "DelayedArray"] with 2 slots
  .. .. ..@ dimnames:List of 2
  .. .. .. ..$ : int -1

> assays(sce)$counts_dgC <- as(assays(sce)$counts, "dgCMatrix")
Error in rbind(...) : negative extents to matrix
hpages commented 6 months ago

@mathewchamberlain or you can simply coerce the dgRMatrix to dgCMatrix with as( , "CsparseMatrix").

@Marwansha Yes, as( , "dgCMatrix") is the way to convert a DelayedMatrix object to dgCMatrix. Please realize that you are reporting a problem in an issue that was closed 3 years ago and has not much to do with your problem, so the chances that this will be ignored are pretty high. My advice is that you open a new issue with a full reproducible example and the output of your sessionInfo(). Also if you're going to show the output of str() on your DelayedMatrix object, please don't truncate it.