API CHANGE: support peaksData to return data.frame or matrix

jorainer commented 1 year ago

This is related to issue #287 and was discussed here: change the MsBackend API to support peaksData to return a list of data.frame instead (or in addition) to a list of matrix. This would enable native support of peak variables (annotations) that are not numeric.

This change should not break any functionality, but we need to check also if and how strong it's impact on the performance is.

jorainer commented 1 year ago

Comparing impact of returning a data.frame instead of matrix by peaksData:

library(Spectra)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                  package = "msdata")
be <- backendInitialize(MsBackendMzR(), fl)

mem <- backendInitialize(MsBackendMemory(), spectraData(be))
mem2 <- backendInitialize(MsBackendMemory2(), spectraData(be))

class(mem@peaksData[[1L]])
[1] "matrix" "array" 
class(mem2@peaksData[[1L]])
[1] "data.frame"

The MsBackendMemory2 is essentially identical to the MsBackendMemory only that it keeps "mz" and "intensity" values in a data.frame instead of a matrix.

Tests at the `MsBackend` level

Performance of accessing peaksData, mz and intensity.

library(testthat)
library(microbenchmark)
microbenchmark(peaksData(mem), peaksData(mem2))
Unit: microseconds
            expr   min     lq    mean median     uq    max neval cld
  peaksData(mem) 5.486 5.6200 5.81063 5.6945 5.8110 13.633   100   a
 peaksData(mem2) 5.663 5.7825 6.25971 5.8550 5.9495 42.090   100   a

No difference (not unexpectedly).

expect_equal(mz(mem), mz(mem2))
microbenchmark(mz(mem), mz(mem2), times = 13)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
  mz(mem) 36.67263 38.38423 40.14132 38.97847 41.48369 49.28339    13  a 
 mz(mem2) 46.58165 48.36403 51.12181 51.03753 52.70381 58.58926    13   b

The matrix-based backend is about 1.3 times faster. Note: access using $mz and $intensity instead of [[ would increase performance again

library(IRanges)
mymz <- function(x) {
    NumericList(lapply(x@peaksData, function(z) z$mz), compress = FALSE)
}
expect_equal(mz(mem), mymz(mem2))
microbenchmark(mz(mem), mz(mem2), mymz(mem2), times = 13)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
    mz(mem) 36.32873 36.84343 38.59994 38.31872 39.07635  43.59662    13   a
   mz(mem2) 46.81475 48.03289 65.21417 50.94912 52.18459 246.95200    13   a
 mymz(mem2) 36.11362 36.58824 38.32284 38.28542 39.72803  41.07391    13   a

Also for spectraData, the matrix-based backend is about 1.25 times faster (not shown).

Tests at the `Spectra` level

Performing some tests at the Spectra level. Here we might see the difference to subset matrix or data.frame by row.

sps <- Spectra(mem)
sps2 <- Spectra(mem2)

Accessing m/z:

expect_equal(mz(sps), mz(sps2))
microbenchmark(mz(sps), mz(sps2), times = 13)
Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval cld
  mz(sps) 35.9689 36.66129 38.18032 37.12317 38.36346 42.75073    13  a 
 mz(sps2) 45.1387 46.18861 47.53825 46.72444 48.27922 52.71166    13   b

Similar to above. Next we compare performance for compareSpectra:

sps_sub <- filterMsLevel(sps)[1:20]
sps2_sub <- filterMsLevel(sps2)[1:20]

expect_equal(compareSpectra(sps_sub), compareSpectra(sps2_sub))
microbenchmark(compareSpectra(sps_sub), compareSpectra(sps2_sub), times = 13)
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max
  compareSpectra(sps_sub) 47.81649 50.13000 52.51268 51.86670 54.86539 58.57741
 compareSpectra(sps2_sub) 74.27192 75.84804 78.10415 76.80175 81.55561 84.79143
 neval cld
    13  a 
    13   b

data.frame backend is about 1.5 times slower. The reason being most likely accessing columns of the peaks matrix which could eventually be improved by directly using $mz and $intensity.

For the filtering operations the subset by row will be an issue. Here we know that matrix is faster than data.frame.

sps_f <- filterIntensity(sps, intensity = 1)
sps2_f <- filterIntensity(sps2, intensity = 1)

microbenchmark(mz(sps), mz(sps2), mz(sps_f), mz(sps2_f), times = 13)
Unit: milliseconds
       expr      min        lq      mean    median        uq       max neval
    mz(sps)  37.1891  38.26193  38.93608  38.79268  39.34904  41.02237    13
   mz(sps2)  47.2803  50.00244  52.43369  52.15033  54.37638  58.17806    13
  mz(sps_f) 194.8442 198.56253 207.51396 203.34011 207.82562 243.59039    13
 mz(sps2_f) 386.7848 392.46101 404.23477 397.99145 404.28471 456.45066    13
  cld
 a   
  b  
   c 
    d

The data.frame backend is thus 2 times slower.

jorainer commented 1 year ago

Maybe a workable solution would be to allow the backend to return a matrix if columns = c("mz", "intensity") in the peaksData call and a data.frame when additional columns are requested. Thus peaksData,MsBackend should return a list of two-dimensional data structures (e.g., matrix or data.frame) but we don't impose that it has to be always a matrix (like we do now). The functions in Spectra working with these lists of peak data will use [ operations that would be supported by both matrix and data.frame.

Thus, we would only run into the above described performance issues if the user has data with additional peak annotations and if he/she requests them from the backend by specifically using e.g. columns = c("mz", "intensity", "peak_annotation") in the peaksData call.

Any opinion @lgatto @sgibb ?

jorainer commented 1 year ago

I'm developing currently in the jomain branch. First change:

peaksData,MsBackendMemory returns a list of matrix by default (with parameter columns = c("mz", "intensity").
peaksData,MsBackendMemory returns a list of data.frames if additional/other peak variables are requested.

that should enable to start working on ensuring peaksData,Spectra works with filtering peak matrices again.

jorainer commented 1 year ago

In Spectra, the .peaksapply will be the key function. That function ensures that the lazy processing queue is applied properly and the eventually filtered/changes peak data is returned.

jorainer commented 1 year ago

The good news: this update already enables the main issue/problem discussed in the last dev call:

Create a Spectra with an additional peak annotation:

library(Spectra)

df <- data.frame(rtime = c(1.1, 1.2, 1.3, 1.4),
                 msLevel = 1L)
df$mz <- list(c(13, 14.1, 22, 23, 24, 49),
              c(45.1, 56),
              c(34.3, 134.4, 344, 443),
              c(12.1, 31))
df$intensity <- list(c(100, 300, 30, 120, 12, 34),
                     c(345, 234),
                     c(123, 124, 145, 3),
                     c(122, 421))

#' add some arbitrary information for each peak to the data.frame
df$ann <- list(c("a", NA, "b", "c", "d", NA),
               c("e", "f"),
               c("g", "h", "i", NA),
               c("j", "k"))
B <- Spectra(df, peaksVariables = c("mz", "intensity", "ann"))

peaksData(B)[[1L]]
       mz intensity
[1,] 13.0       100
[2,] 14.1       300
[3,] 22.0        30
[4,] 23.0       120
[5,] 24.0        12
[6,] 49.0        34

peaksData(B, columns = peaksVariables(B))[[1L]]
    mz intensity  ann
1 13.0       100    a
2 14.1       300 <NA>
3 22.0        30    b
4 23.0       120    c
5 24.0        12    d
6 49.0        34 <NA>

So, in the first case a matrix is returned, in the second a data.frame. This also enables (and does no longer break) filtering:

B2 <- filterMzValues(B, 23, tolerance = 1)
peaksData(B2, columns = c("mz", "intensity"))[[1L]]
     mz intensity
[1,] 22        30
[2,] 23       120
[3,] 24        12

And including additional peak variables:

peaksData(B2, columns = peaksVariables(B))[[1L]]
  mz intensity ann
3 22        30   b
4 23       120   c
5 24        12   d

jorainer commented 1 year ago

What is not yet working (needs some fixes on the Spectra level) is when individual peak variables are extracted:

B2$ann[[1L]]
[1] "a" NA  "b" "c" "d" NA

jorainer commented 1 year ago

In addition, spectraData,Spectra needs to ensure peaks variables are correctly subsetted/processed if lazy processing queue is not empty.

jorainer commented 1 year ago

spectraData now also correctly subsets the peaks data if needed:

spectraData(B2, column = "ann")
DataFrame with 4 rows and 1 column
     ann
  <list>
1  b,c,d
2       
3       
4

jorainer commented 1 year ago

And applyProcessing correctly updates the peaks data and additional peaks variables in the backend:

B3 <- applyProcessing(B2)
B3@backend@peaksDataFrame
[[1]]
  ann
3   b
4   c
5   d

[[2]]
[1] ann
<0 rows> (or 0-length row.names)

[[3]]
[1] ann
<0 rows> (or 0-length row.names)

[[4]]
[1] ann
<0 rows> (or 0-length row.names)

jorainer commented 1 year ago

We need also to ensure that replacing peaks variables works properly with a non-empty processing queue (this is in fact a bug in the current implementation).

jorainer commented 1 year ago

Replacing peaks variables works (in fact, throws an error if the processing queue is not empty). After applyProcessing peaks variables can be replaced with spectraData<- and $<-. All this was added in the recent commits (in the jomain branch).

rformassspectrometry / Spectra