Add support for sparse matrices

Thanks for this amazing package! I have used it in so many projects already!

I just wanted to check if there are plans to support sparse matrices as well any time soon?

Hello, thank you for raising the issue.

I am not very familiar with sparse matrices. Could you provide a bit more context - what kind of support is needed for them? Are the current functions not working when used on sparse matrices?

Thanks for your quick reply!

Sparse matrices are often used in R to represent large amounts of data that may contain a fair number of zeros, such as those derived from Poisson, negative binomial and related distributions.

The Matrix package (link) allows for several types of sparse matrices, and the most common ones in my field (that is, modeling single cells in biological systems) are dgCMatrix and lgCMatrix.

The main advantage of using sparse matrices is the reduction in memory usage. This is a good introduction to help make oneself more familiar with the topic: https://slowkow.com/notes/sparse-matrix/

Here is some code to create sparse matrices:

library(Matrix)

(mat <- base::matrix(data = 0:9, nrow = 2))

(sparse_mat <- as(object = mat, Class = "dgCMatrix"))

(sparse_lgl_mat <- sparse_mat > 5)

Here is some code to test matrixTests on dense and sparse matrices:

library(matrixTests)

X <- base::matrix(data = rnbinom(n = 1e4,
                                 size = 1,
                                 mu = 5),
                  nrow = 10)
hist(x = X[1,], breaks = 100)

Y <- base::matrix(data = rnbinom(n = 1e4,
                                 size = 1,
                                 mu = 10),
                  nrow = 10)
hist(x = Y[1,], breaks = 100)

matrixTests::row_wilcoxon_twosample(x = X,
                                    y = Y)

sparse_X <- as(object = X, Class = "dgCMatrix")
sparse_Y <- as(object = Y, Class = "dgCMatrix")

matrixTests::row_wilcoxon_twosample(x = sparse_X,
                                    y = sparse_Y)

In terms of which kind of support is needed for sparse matrices, I would refer to the great BioConductor sparseMatrixStats package (link), which expands the also great matrixStats package (link), that in turn expands the Matrix package mentioned above.

EDIT: I am mentioning these packages just because they may provide useful functions for matrix operations that could be used to support sparse matrices as well.

I will tag the authors of those other two packages, but I can obviously only hope that they might one day find time in their busy schedules to comment if they find it appropriate: @HenrikBengtsson @const-ae

Thanks for an elaborate reply. Here are a few quick thoughts and questions from my side:

Currently users can call row_wilcox_twosample(as.matrix(sparse_X), as.matrix(sparse_Y)), as you will already know.
In addition, maybe matrixTests should convert the inputs to a simple R matrix, where appropriate. I am hesitant to just do as.matrix() on any input. But maybe having a set of classes that will be turned into a simple matrix inside the function is appropriate as a first step. In that case sparse matrices would be converted to simple matrices. No gain in terms of memory usage, but the functions would work.
It would also be possible to handle sparse matrices by having dedicated functions that work on this class. However then matrixTests gains dependencies on all those packages implementing object like that.
More importantly - are the statistical tests implemented here appropriate for sparse matrices? For example, take Andersdon-Darling test for normality. We can spend time and effort for making this test work on sparse matrices, but is there a point in doing that if this test will never be (or should never be) executed on a sparse matrix object.
Finally, and related to the above, I noticed that turning a sparse matrix to a matrix with as.matrix fills it up with zero values. But should the missing values in a sparse matrix be treated as zero or as unknown (NA)? In single cell, as far as I am aware, 0 is there because of technical reasons (insufficient sequencing depth, or failure to capture every sequence from a cell). In that case, following the Anderson-Darling test example, it would be better to remove them first.

Thanks for your elaborate reply as well.

You already know as.matrix on large sparse matrices is slow and very memory hungry.
See above (1).
Yup.
Well, in my opinion, and I don't mean it in an snarky way, that is up to the user to decide. If all statistician would stop building statistical models and tools bc they are afraid of how they one day they might potentially be misused... Do you know how base R tests deal with that? A potential solution in case of an gross misuse (let's say, more than 50% of values are zero in a sparse matrix and the test expects a continuous distribution) maybe a warning could be generated that the results might be unreliable? Does base R have such warnings that could potentially be re-used?
Well, missing values in a sparse matrix are zero, not NA. Many of such zeros are due to under-sampling (EDIT: referring to single cell RNA-seq data), but many also due to what people call "bursts of transcription" from a locus (which I understand to mean that even a "open" locus poised for transcription is probably not continuously producing mRNA molecules and so the pool of mRNA molecules from that locus might be empty at a given time depending on the mRNA degradation rate and its protein half-life). However, my personal opinion is that measured zeros (again, zeros in sparse matrices are zeros and not NA) better be treated as what they are (that is, measured zeros) using the imperfect models of reality that we have at our disposal. If higher sensitivity is needed to fill up the zeros, it is up to the experimentalist to increase sequencing depth and/or apply a different method to increase coverage, as well as perform other assays to test their hypothesis.

PS: You obviously worked sparse matrix in the past. I just don't understand why you would say you "not very familiar" with them. My reply is completely out of place now and I spent more time on it than I would want to admit... PS2: The answer to my first question is also obviously a no, right? It seems obvious that you considered sparse matrices in the past and there are no plans to support them any time soon?

Hello, maintainer of matrixStats here. Thanks for the dicussions.

I'd like to add that, although it seems quick to add support for sparse matrices, the extra burden added can be significantly. There are so many corner cases you need to account for and test for. Having maintained 'matrixStats', I'm still not suprised if someone reports on a corner case of input data that we didn't account for. Writing good tests for numerical methods is tricky and a lot of work. Then there's the extra work coming from additional feature requests that are now possible, because of the expanded scope.

As mentioned, for matrixStats, there's SparseMatrixStats which replicates the matrixStats API for sparse matrices. I'm glad I/we didn't have to maintain that too, because of all the ins and outs that comes with that. Although I'm using sparse matrices too, I wouldn't have had the experience to cover all grounds. I would probably learn a lot along the way, but that would be even more work. We found a nice balance with matrixStats and SparseMatrixStats, where matrixStats sets the de facto API standard and SparseMatrixStats follows. This way I/we developing matrixStats can focus on the core business, but we do have to make sure we don't break anything. Sometimes the tests of SparseMatrixStats caught breaking changes that we would have missed otherwise. It's a dance of tango, and in each iteration things gets better and better, and it benefits the community (often without them noticing).

If all statistician would stop building statistical models and tools bc they are afraid of how they one day they might potentially be misused... Do you know how base R tests deal with that?

From my experience, it's okay to be "experimental" and taking risks when you're starting out, but when you get to a point when there are lots of users and packages depending on your package, you need to start thinking about the impact design or implementation bugs has. If the package provides statistical tools, it's then also likely it's used in science, and then the implications from a mistake in your code can be huge. The risk for some mistakes to go unnoticed for months and years should not be taken lightly. There are high-impact papers out there whose significant results are solely due to bugs. People use these core tools trusting them to work and, sometimes, they're used indirectly through nested package dependencies.

For example, in matrixStats there is colVars() and rowVars(). For performance reasons, a user can pass an already calculated mean estimate via argument center. This was an obvious addition back in the day. However, it turns out you can get grossly different estimates if you don't pass the sample mean estimate, but, say, the median. In turn, this difference depends on which algorithm we use to calculate the variance. After introducing internal validation of the center estimate, we found that were instances out there on CRAN and Bioconductor that used center under the wrong assumptions - done by very experienced users. Because of these type of things, we're looking into pruning and tighten up the API, to avoid loop-hole mistakes slipping through.

So, when we get the privileged to maintain heavily-used, central packages, we need to be careful. It also provides us with means to pay back to the community and the sciences, e.g. in some cases we can add internal assertions checking for user mistakes and that way early on help catch mistakes in long-running scientific projects.

My $.02 of coffee comments

Thank you for the comment, @HenrikBengtsson, I much appreciate you posting and sharing your experience here.

I am trying to leave myself out of the picture and go with what is best for matrixTests. My decision for now is to leave this issue open but set it at low priority. This way matrixTests can decide on some kind of support in the future, and meanwhile its users can find this place and read everything here themselves, or even come up with their own proposals and solutions.

My current (but subject to change) opinion:

I see one way of supporting sparse matrices from within matrixTests and that is by turning all the methods here into generics that dispatch differently in cases when the first argument is a sparse matrix object. Then, when the default dispatch calls (e.g.) matrixStats::rowVars() sparse matrix variants would call sparseMatrixStats::rowVars(), etc. This way implementing sparse matrix support should be easier in matrixTests, compared with Henrik's experience with matrixStats. Of course only because the heavy lifting is already done by matrixStats and sparseMatrixStats. matrixTests would just use already implemented methods, thus being mostly hidden from the internals of sparse matrix object intricacies.

However, I am still reluctant because:

This requires major effort - at this time that effort can be spent elsewhere.
matrixTests would gain a BioConductor dependency - we try to use as few as possible, and avoid BioConductor.
matrixTests have tests that are not suitable for sparse matrices - unclear if those should be implemented at all.

Hence this issue will remain open as a feature request but, for now, will not be prioritized.

karoliskoncevicius / matrixTests

Add support for sparse matrices #34