Closed jorainer closed 2 years ago
The promised performance comparisons:
library(Spectra)
library(msdata)
fl <- system.file(
"proteomics",
"TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz",
package = "msdata")
sps_od <- Spectra(fl)
sps_im <- setBackend(sps_od, backend = MsBackendDataFrame())
sps_df <- setBackend(sps_od, backend = new("MsBackendDF"))
Access of a single spectra variable:
library(microbenchmark)
microbenchmark(msLevel(sps_im),
msLevel(sps_df))
Unit: microseconds
## expr min lq mean median uq max neval cld
## msLevel(sps_im) 221.529 226.9545 239.9035 231.6955 243.0315 569.207 100 b
## msLevel(sps_df) 6.925 7.6890 9.7830 10.2260 11.2270 21.178 100 a
The same using $
microbenchmark(sps_im$msLevel,
sps_df$msLevel)
## Unit: microseconds
## expr min lq mean median uq max neval
## sps_im$msLevel 1028.080 1107.3950 1302.7935 1164.467 1261.809 8200.162 100
## sps_df$msLevel 75.674 84.5035 106.4093 102.841 119.552 326.420 100
## cld
## b
## a
Accessing a spectra variable is thus faster with the new backend. Getting the full spectra data:
microbenchmark(spectraData(sps_im),
spectraData(sps_df))
## Unit: microseconds
## expr min lq mean median uq max neval
## spectraData(sps_im) 494.953 523.392 573.1630 553.4555 622.9645 1025.878 100
## spectraData(sps_df) 266.627 287.211 322.0745 306.0795 321.9560 1426.623 100
## cld
## b
## a
Again, the new implementation is faster. Adding a new spectra variable:
rn <- rnorm(length(sps_im))
microbenchmark(sps_im$new_var <- rn,
sps_df$new_var <- rn)
## Unit: microseconds
## expr min lq mean median uq
## sps_im$new_var <- rn 51840.243 52723.1685 53878.6522 53143.387 54246.101
## sps_df$new_var <- rn 308.836 386.9755 452.4632 464.211 504.756
## max neval cld
## 61717.621 100 b
## 802.351 100 a
Also here, the new backend is faster. Getting peaks variables, first only m/z values:
microbenchmark(mz(sps_im),
mz(sps_df))
## Unit: microseconds
## expr min lq mean median uq max
## mz(sps_im) 265.178 350.0185 469.5975 519.9465 554.7085 828.235
## mz(sps_df) 82952.300 88177.9565 99941.9456 93540.1265 98724.3780 267566.206
## neval cld
## 100 a
## 100 b
Here the MsBackendDataFrame
is faster. The reason is that the MsBackendDF
stores m/z and intensity values as a list
of matrix
, so, getting a single peak variable requires looping through all of them. Getting the full peaks data:
microbenchmark(peaksData(sps_im),
peaksData(sps_df))
## Unit: microseconds
## expr min lq mean median uq
## peaksData(sps_im) 292926.28 311304.2930 373186.469 326718.765 484839.194
## peaksData(sps_df) 514.59 740.5065 1400.863 1438.617 1685.699
## max neval cld
## 545472.232 100 b
## 9628.732 100 a
Here the MsBackendDF
is faster. Note that most code we have works on the full peaks data and almost never separately on the mz
or intensity
values, thus, IMHO, the overall performance of Spectra
will benefit from the way the MsBackendDF
stores the data.
Finally, subsetting:
microbenchmark(filterMsLevel(sps_im, msLevel = 2L),
filterMsLevel(sps_df, msLevel = 2L))
## Unit: milliseconds
## expr min lq mean median
## filterMsLevel(sps_im, msLevel = 2L) 2.412850 2.612675 3.061061 2.990318
## filterMsLevel(sps_df, msLevel = 2L) 1.181261 1.334627 1.588027 1.599797
## uq max neval cld
## 3.184945 5.750259 100 b
## 1.735574 2.898642 100 a
set.seed(123)
idx <- sample(seq_along(sps_im), 100)
microbenchmark(sps_im[idx],
sps_df[idx])
## Unit: microseconds
## expr min lq mean median uq max neval cld
## sps_im[idx] 997.839 1042.0160 1100.4201 1094.162 1124.1785 1431.940 100 b
## sps_df[idx] 155.472 175.5825 186.9224 186.371 197.5855 278.679 100 a
Thus, the MsBackendDF
can also be subsetted faster than the MsBackendDataFrame
.
With the exception of extracting mz
or intensity
values separately, the MsBackendDF
outperforms the MsBackendDataFrame
for all operations. I would thus suggest to use the MsBackendDF
as default in-memory backend. To avoid breaking any old code/packages I would still keep the MsBackendDataFrame
as a legacy backend, but would by default return a MsBackendDF
with the constructor MsBackendDataFrame()
. Thus, newly created in-memory backends would use the MsBackendDF
while the MsBackendDataFrame
is still supported (and also needs to be, since the MsBackendMzR
extends it directly.
First quick comments:
data.frame
and the peaks data in a list
, why not drop the reference to dataframes/DF and call it MsBackendMemory
? Wasn't that the original plan?DataFrame
backend (for example if I want to add a column/list of Bioconductor S4 objects)? Regarding MsBackendMemory
: yes, that's a better name indeed.
And regarding your second comment: makes sense - then we maybe keep MsBackendDataFrame()
to return a MsBackendDataFrame
object and have a second MsBackendMemory()
constructor for the new in-memory implementation. I could update all documentations to use primarily the new MsBackendMemory
as in-memory backend to promote it. That might also be less confusing to the user and reduce the potential to break existing code/analyses or other packages.
Our original
MsBackendDataFrame
backend is not that fast and efficient as originally planned. Using a singleDataFrame
to store all data was most likely not the best idea since a) subsetting ofDataFrame
is slow (compared to adata.frame
and b) storing all data, also the peaks data within oneDataFrame
has an impact on any (subsetting) operation on that data.I've now implemented a
MsBackendDF
with this definition:Spectra data is thus stored within a
data.frame
(instead of aDataFrame
) and the peaks data is stored in its more natural way, i.e. as alist
of numerical matrices. Also, I've added a new slotpeaksDataFrame
that would allow to store any additional peaks variable in it (e.g. peak annotations etc).I did some performance calculations on this backend (see further below) and as expected, it is faster than the original
MsBackendDataFrame
. What I would now suggest is to simply add this backend as an additional backend to theSpectra
package (still keeping theMsBackendDataFrame
because other backends extend that) and change theMsBackendDataFrame()
constructor to return aMsBackendDF
instead of aMsBackendDataFrame
. I think that should be a minimally invasive change that would switch the default in-memory backend fromMsBackendDataFrame
toMsBackendDF
.Any thoughts @lgatto and @sgibb ? If you're OK with it I would then go ahead and make a PR.