A more efficient in-memory backend

jorainer commented 2 years ago

Our original MsBackendDataFrame backend is not that fast and efficient as originally planned. Using a single DataFrame to store all data was most likely not the best idea since a) subsetting of DataFrame is slow (compared to a data.frame and b) storing all data, also the peaks data within one DataFrame has an impact on any (subsetting) operation on that data.

I've now implemented a MsBackendDF with this definition:

setClass("MsBackendDF",
         contains = "MsBackend",
         slots = c(spectraData = "data.frame",
                   peaksData = "list",
                   peaksDataFrame = "list"),
         prototype = prototype(spectraData = data.frame(),
                               peaksData = list(),
                               peaksDataFrame = list(),
                               readonly = FALSE,
                               version = "0.1"))

Spectra data is thus stored within a data.frame (instead of a DataFrame) and the peaks data is stored in its more natural way, i.e. as a list of numerical matrices. Also, I've added a new slot peaksDataFrame that would allow to store any additional peaks variable in it (e.g. peak annotations etc).

I did some performance calculations on this backend (see further below) and as expected, it is faster than the original MsBackendDataFrame. What I would now suggest is to simply add this backend as an additional backend to the Spectra package (still keeping the MsBackendDataFrame because other backends extend that) and change the MsBackendDataFrame() constructor to return a MsBackendDF instead of a MsBackendDataFrame. I think that should be a minimally invasive change that would switch the default in-memory backend from MsBackendDataFrame to MsBackendDF.

Any thoughts @lgatto and @sgibb ? If you're OK with it I would then go ahead and make a PR.

jorainer commented 2 years ago

The promised performance comparisons:

library(Spectra)
library(msdata)

fl <- system.file(
    "proteomics",
    "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz",
    package = "msdata")
sps_od <- Spectra(fl)
sps_im <- setBackend(sps_od, backend = MsBackendDataFrame())
sps_df <- setBackend(sps_od, backend = new("MsBackendDF"))

Access of a single spectra variable:

library(microbenchmark)
microbenchmark(msLevel(sps_im),
               msLevel(sps_df))
Unit: microseconds
##            expr     min       lq     mean   median       uq     max neval cld
## msLevel(sps_im) 221.529 226.9545 239.9035 231.6955 243.0315 569.207   100   b
## msLevel(sps_df)   6.925   7.6890   9.7830  10.2260  11.2270  21.178   100  a

The same using $

microbenchmark(sps_im$msLevel,
               sps_df$msLevel)

## Unit: microseconds
##           expr      min        lq      mean   median       uq      max neval
## sps_im$msLevel 1028.080 1107.3950 1302.7935 1164.467 1261.809 8200.162   100
## sps_df$msLevel   75.674   84.5035  106.4093  102.841  119.552  326.420   100
## cld
##   b
##  a

Accessing a spectra variable is thus faster with the new backend. Getting the full spectra data:

microbenchmark(spectraData(sps_im),
               spectraData(sps_df))
## Unit: microseconds
##                expr     min      lq     mean   median       uq      max neval
## spectraData(sps_im) 494.953 523.392 573.1630 553.4555 622.9645 1025.878   100
## spectraData(sps_df) 266.627 287.211 322.0745 306.0795 321.9560 1426.623   100
## cld
##   b
##  a

Again, the new implementation is faster. Adding a new spectra variable:

rn <- rnorm(length(sps_im))
microbenchmark(sps_im$new_var <- rn,
               sps_df$new_var <- rn)
## Unit: microseconds
##                 expr       min         lq       mean    median        uq
## sps_im$new_var <- rn 51840.243 52723.1685 53878.6522 53143.387 54246.101
## sps_df$new_var <- rn   308.836   386.9755   452.4632   464.211   504.756
##       max neval cld
## 61717.621   100   b
##   802.351   100  a

Also here, the new backend is faster. Getting peaks variables, first only m/z values:

microbenchmark(mz(sps_im),
               mz(sps_df))
## Unit: microseconds
##       expr       min         lq       mean     median         uq        max
## mz(sps_im)   265.178   350.0185   469.5975   519.9465   554.7085    828.235
## mz(sps_df) 82952.300 88177.9565 99941.9456 93540.1265 98724.3780 267566.206
## neval cld
##   100  a 
##   100   b

Here the MsBackendDataFrame is faster. The reason is that the MsBackendDF stores m/z and intensity values as a list of matrix, so, getting a single peak variable requires looping through all of them. Getting the full peaks data:

microbenchmark(peaksData(sps_im),
               peaksData(sps_df))
## Unit: microseconds
##              expr       min          lq       mean     median         uq
## peaksData(sps_im) 292926.28 311304.2930 373186.469 326718.765 484839.194
## peaksData(sps_df)    514.59    740.5065   1400.863   1438.617   1685.699
##        max neval cld
## 545472.232   100   b
##   9628.732   100  a

Here the MsBackendDF is faster. Note that most code we have works on the full peaks data and almost never separately on the mz or intensity values, thus, IMHO, the overall performance of Spectra will benefit from the way the MsBackendDF stores the data.

Finally, subsetting:

microbenchmark(filterMsLevel(sps_im, msLevel = 2L),
               filterMsLevel(sps_df, msLevel = 2L))
## Unit: milliseconds
##                                expr      min       lq     mean   median
## filterMsLevel(sps_im, msLevel = 2L) 2.412850 2.612675 3.061061 2.990318
## filterMsLevel(sps_df, msLevel = 2L) 1.181261 1.334627 1.588027 1.599797
##       uq      max neval cld
## 3.184945 5.750259   100   b
## 1.735574 2.898642   100  a

set.seed(123)
idx <- sample(seq_along(sps_im), 100)
microbenchmark(sps_im[idx],
               sps_df[idx])
## Unit: microseconds
##        expr     min        lq      mean   median        uq      max neval cld
## sps_im[idx] 997.839 1042.0160 1100.4201 1094.162 1124.1785 1431.940   100   b
## sps_df[idx] 155.472  175.5825  186.9224  186.371  197.5855  278.679   100  a

Thus, the MsBackendDF can also be subsetted faster than the MsBackendDataFrame.

Summary

With the exception of extracting mz or intensity values separately, the MsBackendDF outperforms the MsBackendDataFrame for all operations. I would thus suggest to use the MsBackendDF as default in-memory backend. To avoid breaking any old code/packages I would still keep the MsBackendDataFrame as a legacy backend, but would by default return a MsBackendDF with the constructor MsBackendDataFrame(). Thus, newly created in-memory backends would use the MsBackendDF while the MsBackendDataFrame is still supported (and also needs to be, since the MsBackendMzR extends it directly.

lgatto commented 2 years ago

First quick comments:

Given that the spectra data is stored in a data.frame and the peaks data in a list, why not drop the reference to dataframes/DF and call it MsBackendMemory? Wasn't that the original plan?
What if I still want the DataFrame backend (for example if I want to add a column/list of Bioconductor S4 objects)?

jorainer commented 2 years ago

Regarding MsBackendMemory: yes, that's a better name indeed.

And regarding your second comment: makes sense - then we maybe keep MsBackendDataFrame() to return a MsBackendDataFrame object and have a second MsBackendMemory() constructor for the new in-memory implementation. I could update all documentations to use primarily the new MsBackendMemory as in-memory backend to promote it. That might also be less confusing to the user and reduce the potential to break existing code/analyses or other packages.

rformassspectrometry / Spectra

A more efficient in-memory backend #246

Summary