Open PedroMilanezAlmeida opened 1 year ago
Hello, thank you for raising the issue.
I am not very familiar with sparse matrices. Could you provide a bit more context - what kind of support is needed for them? Are the current functions not working when used on sparse matrices?
Thanks for your quick reply!
Sparse matrices are often used in R to represent large amounts of data that may contain a fair number of zeros, such as those derived from Poisson, negative binomial and related distributions.
The Matrix
package (link) allows for several types of sparse matrices, and the most common ones in my field (that is, modeling single cells in biological systems) are dgCMatrix
and lgCMatrix
.
The main advantage of using sparse matrices is the reduction in memory usage. This is a good introduction to help make oneself more familiar with the topic: https://slowkow.com/notes/sparse-matrix/
Here is some code to create sparse matrices:
library(Matrix)
(mat <- base::matrix(data = 0:9, nrow = 2))
(sparse_mat <- as(object = mat, Class = "dgCMatrix"))
(sparse_lgl_mat <- sparse_mat > 5)
Here is some code to test matrixTests
on dense and sparse matrices:
library(matrixTests)
X <- base::matrix(data = rnbinom(n = 1e4,
size = 1,
mu = 5),
nrow = 10)
hist(x = X[1,], breaks = 100)
Y <- base::matrix(data = rnbinom(n = 1e4,
size = 1,
mu = 10),
nrow = 10)
hist(x = Y[1,], breaks = 100)
matrixTests::row_wilcoxon_twosample(x = X,
y = Y)
sparse_X <- as(object = X, Class = "dgCMatrix")
sparse_Y <- as(object = Y, Class = "dgCMatrix")
matrixTests::row_wilcoxon_twosample(x = sparse_X,
y = sparse_Y)
In terms of which kind of support is needed for sparse matrices, I would refer to the great BioConductor sparseMatrixStats
package (link), which expands the also great matrixStats
package (link), that in turn expands the Matrix
package mentioned above.
EDIT: I am mentioning these packages just because they may provide useful functions for matrix operations that could be used to support sparse matrices as well.
I will tag the authors of those other two packages, but I can obviously only hope that they might one day find time in their busy schedules to comment if they find it appropriate: @HenrikBengtsson @const-ae
Thanks for an elaborate reply. Here are a few quick thoughts and questions from my side:
row_wilcox_twosample(as.matrix(sparse_X), as.matrix(sparse_Y))
, as you will already know.as.matrix()
on any input. But maybe having a set of classes that will be turned into a simple matrix inside the function is appropriate as a first step. In that case sparse matrices would be converted to simple matrices. No gain in terms of memory usage, but the functions would work.as.matrix
fills it up with zero values. But should the missing values in a sparse matrix be treated as zero or as unknown (NA)? In single cell, as far as I am aware, 0 is there because of technical reasons (insufficient sequencing depth, or failure to capture every sequence from a cell). In that case, following the Anderson-Darling test example, it would be better to remove them first.Thanks for your elaborate reply as well.
as.matrix
on large sparse matrices is slow and very memory hungry.PS: You obviously worked sparse matrix in the past. I just don't understand why you would say you "not very familiar" with them. My reply is completely out of place now and I spent more time on it than I would want to admit... PS2: The answer to my first question is also obviously a no, right? It seems obvious that you considered sparse matrices in the past and there are no plans to support them any time soon?
Hello, maintainer of matrixStats here. Thanks for the dicussions.
I'd like to add that, although it seems quick to add support for sparse matrices, the extra burden added can be significantly. There are so many corner cases you need to account for and test for. Having maintained 'matrixStats', I'm still not suprised if someone reports on a corner case of input data that we didn't account for. Writing good tests for numerical methods is tricky and a lot of work. Then there's the extra work coming from additional feature requests that are now possible, because of the expanded scope.
As mentioned, for matrixStats, there's SparseMatrixStats which replicates the matrixStats API for sparse matrices. I'm glad I/we didn't have to maintain that too, because of all the ins and outs that comes with that. Although I'm using sparse matrices too, I wouldn't have had the experience to cover all grounds. I would probably learn a lot along the way, but that would be even more work. We found a nice balance with matrixStats and SparseMatrixStats, where matrixStats sets the de facto API standard and SparseMatrixStats follows. This way I/we developing matrixStats can focus on the core business, but we do have to make sure we don't break anything. Sometimes the tests of SparseMatrixStats caught breaking changes that we would have missed otherwise. It's a dance of tango, and in each iteration things gets better and better, and it benefits the community (often without them noticing).
If all statistician would stop building statistical models and tools bc they are afraid of how they one day they might potentially be misused... Do you know how base R tests deal with that?
From my experience, it's okay to be "experimental" and taking risks when you're starting out, but when you get to a point when there are lots of users and packages depending on your package, you need to start thinking about the impact design or implementation bugs has. If the package provides statistical tools, it's then also likely it's used in science, and then the implications from a mistake in your code can be huge. The risk for some mistakes to go unnoticed for months and years should not be taken lightly. There are high-impact papers out there whose significant results are solely due to bugs. People use these core tools trusting them to work and, sometimes, they're used indirectly through nested package dependencies.
For example, in matrixStats there is colVars()
and rowVars()
. For performance reasons, a user can pass an already calculated mean estimate via argument center
. This was an obvious addition back in the day. However, it turns out you can get grossly different estimates if you don't pass the sample mean estimate, but, say, the median. In turn, this difference depends on which algorithm we use to calculate the variance. After introducing internal validation of the center
estimate, we found that were instances out there on CRAN and Bioconductor that used center
under the wrong assumptions - done by very experienced users. Because of these type of things, we're looking into pruning and tighten up the API, to avoid loop-hole mistakes slipping through.
So, when we get the privileged to maintain heavily-used, central packages, we need to be careful. It also provides us with means to pay back to the community and the sciences, e.g. in some cases we can add internal assertions checking for user mistakes and that way early on help catch mistakes in long-running scientific projects.
My $.02 of coffee comments
Thank you for the comment, @HenrikBengtsson, I much appreciate you posting and sharing your experience here.
I am trying to leave myself out of the picture and go with what is best for matrixTests. My decision for now is to leave this issue open but set it at low priority. This way matrixTests can decide on some kind of support in the future, and meanwhile its users can find this place and read everything here themselves, or even come up with their own proposals and solutions.
My current (but subject to change) opinion:
I see one way of supporting sparse matrices from within matrixTests and that is by turning all the methods here into generics that dispatch differently in cases when the first argument is a sparse matrix object. Then, when the default dispatch calls (e.g.) matrixStats::rowVars()
sparse matrix variants would call sparseMatrixStats::rowVars()
, etc. This way implementing sparse matrix support should be easier in matrixTests, compared with Henrik's experience with matrixStats. Of course only because the heavy lifting is already done by matrixStats and sparseMatrixStats. matrixTests would just use already implemented methods, thus being mostly hidden from the internals of sparse matrix object intricacies.
However, I am still reluctant because:
Hence this issue will remain open as a feature request but, for now, will not be prioritized.
Thanks for this amazing package! I have used it in so many projects already!
I just wanted to check if there are plans to support sparse matrices as well any time soon?