sgibb / MALDIquant

Quantitative Analysis of Mass Spectrometry Data
https://strimmerlab.github.io/software/maldiquant/
60 stars 25 forks source link

using sparse matrix #70

Open paoloinglese opened 2 years ago

paoloinglese commented 2 years ago

I've made some changes to use sparse matrices from "Matrix".

I think it may be better in terms of memory usage, especially for imaging data, where data are often quite sparse.

sgibb commented 2 years ago

Thanks a lot for this great contribution! I will need some time to ensure it doesn't have any negative side effects to (my) workflows and downstream dependencies.

paoloinglese commented 2 years ago

I'm having a look at the errors and fixing them.

paoloinglese commented 2 years ago

Hi,

I've made some modifications to the code. Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0. Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix (see sparseMatrix-functions.R). In this way, you can continue to keep the zeros and the NAs continue to play the role of missing values.

I've also left the support to dense matrices, so you can easily adapt the code in case you need some legacy support.

I had to write new C++ functions for the colMeans and colMedians (/src/colMediansArma.cpp), where the zeros of the sparse matrix behave like NAs, as expected. Please also have a look at the new tests for the sparse matrices. I wish there's no huge mistake. I've tested the code on a Mac, I couldn't check if it runs on Win and Linux, unfortunately.

Of course, the support to the algebra of the sparseMatrixNA is missing (if you sum a "missing" value and number you get a number instead of the "missing"), so you can have some "weird" behaviours for downstream analysis. But, that would require a new class. Another solution would be dropping the difference between NAs and zeros, in that way the implementation would be straightforward.

codecov[bot] commented 2 years ago

Codecov Report

Merging #70 (aa59b28) into master (b1ca2ff) will decrease coverage by 0.66%. The diff coverage is 84.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   88.44%   87.78%   -0.67%     
==========================================
  Files          81       82       +1     
  Lines        1411     1514     +103     
==========================================
+ Hits         1248     1329      +81     
- Misses        163      185      +22     
Impacted Files Coverage Δ
R/as.matrix-functions.R 80.95% <63.63%> (-19.05%) :arrow_down:
src/colMediansArma.cpp 73.84% <73.84%> (ø)
R/intensityMatrix-functions.R 93.33% <75.00%> (-6.67%) :arrow_down:
R/colMedians-functions.R 100.00% <100.00%> (ø)
R/filterPeaks-functions.R 100.00% <100.00%> (ø)
R/merge-functions.R 100.00% <100.00%> (ø)
R/sparseMatrix-functions.R 100.00% <100.00%> (ø)
src/init.c

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b1ca2ff...aa59b28. Read the comment docs.

sgibb commented 2 years ago

Wow, what a great PR!

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0. Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

paoloinglese commented 2 years ago

Wow, what a great PR!

Thanks! :)

Since you make a distinction between 0 and NA, I've used a simple hack to drop the NAs and keep the 0. Basically, the 0s are transformed into .Machine$double.xmin and the NAs are set to 0 before making the sparse matrix

Could you please explain why this is necessary? NA means missing (which should be automatically handled by the sparse matrix, because there is no value), and zero means zero (could happen after baseline correction or transformation; should be a zero in the sparse matrix). When you replace all NA by zero and 0 by .Machine$double.xmin isn't the sparse matrix completely dense (and require 3x times the memory of a matrix)?

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Basically, the idea is to convert the zeros to .Machine$double.xmin and treat the default values of the sparseMatrix as missing. In this way, you preserve the true zeros, and the NAs don't occupy the memory anymore. The tricky part is taking care of this when you do algebra with this matrix, because you want to get the right results (the zeros of the sparse matrix should behave like NAs). For this reason, I wrote the C++ versions colMeans and colMedians. Basically, these two give you the correct result, treating the zeros as NAs. When you define the matrix, you have by default NAs, and you fill the elements with some values. In this case, you just need to convert the values equal to zero to .Machine$double.xmin and fill a sparse matrix. The zeros will take the role of the old NAs.

In MSnbase we often have problems with Rcpp (mostly different complied versions). While it is a great package I don't want to add this dependency if it is not really necessary. So I would maybe replace the .colMedians with apply(x, 2, median) instead (would be slower but IMHO it isn't heavily used by the users anyway) making maintaining easier.

That's easily solved by writing the C versions. Unfortunately, apply wouldn't work because it treats the zeros as zeros while they need to be treated as NAs.

sgibb commented 2 years ago

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing. I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

paoloinglese commented 2 years ago

In sparseMatrix, NAs occupy memory unfortunately. Otherwise, it would be straightforward to implement it.

Thanks for clarification. I wasn't aware of this detail of the Matrix implementation. I am wondering if we need to reimplement our own SparseMatrix or just treat NA and 0 (which isn't present very often) as missing. I have to think about it a while. Finally the intensityMatrix should have a format that everybody can work with as expected (empty cells should be missing (NA) or imputed by spectra information (in that case it would be a dense matrix anyway).

Another completely different route is using something like bigmemory.