Open paoloinglese opened 2 years ago
Thanks a lot for this great contribution! I will need some time to ensure it doesn't have any negative side effects to (my) workflows and downstream dependencies.
I'm having a look at the errors and fixing them.
Hi,
I've made some modifications to the code.
Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0.
Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix (see sparseMatrix-functions.R
).
In this way, you can continue to keep the zeros and the NAs continue to play the role of missing values.
I've also left the support to dense matrices, so you can easily adapt the code in case you need some legacy support.
I had to write new C++ functions for the colMeans and colMedians (/src/colMediansArma.cpp
), where the zeros of the sparse matrix behave like NAs, as expected.
Please also have a look at the new tests for the sparse matrices. I wish there's no huge mistake.
I've tested the code on a Mac, I couldn't check if it runs on Win and Linux, unfortunately.
Of course, the support to the algebra of the sparseMatrixNA is missing (if you sum a "missing" value and number you get a number instead of the "missing"), so you can have some "weird" behaviours for downstream analysis. But, that would require a new class. Another solution would be dropping the difference between NAs and zeros, in that way the implementation would be straightforward.
Merging #70 (aa59b28) into master (b1ca2ff) will decrease coverage by
0.66%
. The diff coverage is84.05%
.
@@ Coverage Diff @@
## master #70 +/- ##
==========================================
- Coverage 88.44% 87.78% -0.67%
==========================================
Files 81 82 +1
Lines 1411 1514 +103
==========================================
+ Hits 1248 1329 +81
- Misses 163 185 +22
Impacted Files | Coverage Δ | |
---|---|---|
R/as.matrix-functions.R | 80.95% <63.63%> (-19.05%) |
:arrow_down: |
src/colMediansArma.cpp | 73.84% <73.84%> (ø) |
|
R/intensityMatrix-functions.R | 93.33% <75.00%> (-6.67%) |
:arrow_down: |
R/colMedians-functions.R | 100.00% <100.00%> (ø) |
|
R/filterPeaks-functions.R | 100.00% <100.00%> (ø) |
|
R/merge-functions.R | 100.00% <100.00%> (ø) |
|
R/sparseMatrix-functions.R | 100.00% <100.00%> (ø) |
|
src/init.c |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b1ca2ff...aa59b28. Read the comment docs.
Wow, what a great PR!
Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0. Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix
Could you please explain why this is necessary? NA
means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA
by zero and 0
by .Machine$double.xmin
isn't the sparse matrix completely dense (and require 3x times the memory of a matrix
)?
In MSnbase
we often have problems with Rcpp
(mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians
with apply(x, 2, median)
instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.
Wow, what a great PR!
Thanks! :)
Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0. Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix
Could you please explain why this is necessary?
NA
means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace allNA
by zero and0
by.Machine$double.xmin
isn't the sparse matrix completely dense (and require 3x times the memory of amatrix
)?
In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.
Basically, the idea is to convert the zeros to .Machine$double.xmin
and treat the default values of the sparseMatrix as missing. In this way, you preserve the true zeros, and the NAs don't occupy the memory anymore. The tricky part is taking care of this when you do algebra with this matrix, because you want to get the right results (the zeros of the sparse matrix should behave like NAs). For this reason, I wrote the C++ versions colMeans and colMedians. Basically, these two give you the correct result, treating the zeros as NAs.
When you define the matrix, you have by default NAs, and you fill the elements with some values. In this case, you just need to convert the values equal to zero to .Machine$double.xmin
and fill a sparse matrix. The zeros will take the role of the old NAs.
In
MSnbase
we often have problems withRcpp
(mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the.colMedians
withapply(x, 2, median)
instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.
That's easily solved by writing the C versions. Unfortunately, apply
wouldn't work because it treats the zeros as zeros while they need to be treated as NAs.
In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.
Thanks for clarification. I wasn't aware of this detail of the Matrix
implementation. I am wondering if we need to reimplement our own SparseMatrix
or just treat NA
and 0
(which isn't present very often) as missing.
I have to think about it a while. Finally the intensityMatrix
should have a format that everybody can work with as expected (empty cells should be missing (NA
) or imputed by spectra information (in that case it would be a dense matrix anyway).
In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.
Thanks for clarification. I wasn't aware of this detail of the
Matrix
implementation. I am wondering if we need to reimplement our ownSparseMatrix
or just treatNA
and0
(which isn't present very often) as missing. I have to think about it a while. Finally theintensityMatrix
should have a format that everybody can work with as expected (empty cells should be missing (NA
) or imputed by spectra information (in that case it would be a dense matrix anyway).
Another completely different route is using something like bigmemory
.
I've made some changes to use sparse matrices from "Matrix".
I think it may be better in terms of memory usage, especially for imaging data, where data are often quite sparse.