Open jorainer opened 1 year ago
Comparing impact of returning a data.frame
instead of matrix
by peaksData
:
library(Spectra)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
package = "msdata")
be <- backendInitialize(MsBackendMzR(), fl)
mem <- backendInitialize(MsBackendMemory(), spectraData(be))
mem2 <- backendInitialize(MsBackendMemory2(), spectraData(be))
class(mem@peaksData[[1L]])
[1] "matrix" "array"
class(mem2@peaksData[[1L]])
[1] "data.frame"
The MsBackendMemory2
is essentially identical to the MsBackendMemory
only that it keeps "mz"
and "intensity"
values in a data.frame
instead of a matrix
.
MsBackend
levelPerformance of accessing peaksData
, mz
and intensity
.
library(testthat)
library(microbenchmark)
microbenchmark(peaksData(mem), peaksData(mem2))
Unit: microseconds
expr min lq mean median uq max neval cld
peaksData(mem) 5.486 5.6200 5.81063 5.6945 5.8110 13.633 100 a
peaksData(mem2) 5.663 5.7825 6.25971 5.8550 5.9495 42.090 100 a
No difference (not unexpectedly).
expect_equal(mz(mem), mz(mem2))
microbenchmark(mz(mem), mz(mem2), times = 13)
Unit: milliseconds
expr min lq mean median uq max neval cld
mz(mem) 36.67263 38.38423 40.14132 38.97847 41.48369 49.28339 13 a
mz(mem2) 46.58165 48.36403 51.12181 51.03753 52.70381 58.58926 13 b
The matrix
-based backend is about 1.3 times faster. Note: access using $mz
and $intensity
instead of [[
would increase performance again
library(IRanges)
mymz <- function(x) {
NumericList(lapply(x@peaksData, function(z) z$mz), compress = FALSE)
}
expect_equal(mz(mem), mymz(mem2))
microbenchmark(mz(mem), mz(mem2), mymz(mem2), times = 13)
Unit: milliseconds
expr min lq mean median uq max neval cld
mz(mem) 36.32873 36.84343 38.59994 38.31872 39.07635 43.59662 13 a
mz(mem2) 46.81475 48.03289 65.21417 50.94912 52.18459 246.95200 13 a
mymz(mem2) 36.11362 36.58824 38.32284 38.28542 39.72803 41.07391 13 a
Also for spectraData
, the matrix
-based backend is about 1.25 times faster (not shown).
Spectra
levelPerforming some tests at the Spectra
level. Here we might see the difference to subset matrix
or data.frame
by row.
sps <- Spectra(mem)
sps2 <- Spectra(mem2)
Accessing m/z:
expect_equal(mz(sps), mz(sps2))
microbenchmark(mz(sps), mz(sps2), times = 13)
Unit: milliseconds
expr min lq mean median uq max neval cld
mz(sps) 35.9689 36.66129 38.18032 37.12317 38.36346 42.75073 13 a
mz(sps2) 45.1387 46.18861 47.53825 46.72444 48.27922 52.71166 13 b
Similar to above. Next we compare performance for compareSpectra
:
sps_sub <- filterMsLevel(sps)[1:20]
sps2_sub <- filterMsLevel(sps2)[1:20]
expect_equal(compareSpectra(sps_sub), compareSpectra(sps2_sub))
microbenchmark(compareSpectra(sps_sub), compareSpectra(sps2_sub), times = 13)
Unit: milliseconds
expr min lq mean median uq max
compareSpectra(sps_sub) 47.81649 50.13000 52.51268 51.86670 54.86539 58.57741
compareSpectra(sps2_sub) 74.27192 75.84804 78.10415 76.80175 81.55561 84.79143
neval cld
13 a
13 b
data.frame
backend is about 1.5 times slower. The reason being most likely accessing columns of the peaks matrix which could eventually be improved by directly using $mz
and $intensity
.
For the filtering operations the subset by row will be an issue. Here we know that matrix
is faster than data.frame
.
sps_f <- filterIntensity(sps, intensity = 1)
sps2_f <- filterIntensity(sps2, intensity = 1)
microbenchmark(mz(sps), mz(sps2), mz(sps_f), mz(sps2_f), times = 13)
Unit: milliseconds
expr min lq mean median uq max neval
mz(sps) 37.1891 38.26193 38.93608 38.79268 39.34904 41.02237 13
mz(sps2) 47.2803 50.00244 52.43369 52.15033 54.37638 58.17806 13
mz(sps_f) 194.8442 198.56253 207.51396 203.34011 207.82562 243.59039 13
mz(sps2_f) 386.7848 392.46101 404.23477 397.99145 404.28471 456.45066 13
cld
a
b
c
d
The data.frame
backend is thus 2 times slower.
Maybe a workable solution would be to allow the backend to return a matrix
if columns = c("mz", "intensity")
in the peaksData
call and a data.frame
when additional columns are requested. Thus peaksData,MsBackend
should return a list
of two-dimensional data structures (e.g., matrix
or data.frame
) but we don't impose that it has to be always a matrix
(like we do now). The functions in Spectra
working with these lists
of peak data will use [
operations that would be supported by both matrix
and data.frame
.
Thus, we would only run into the above described performance issues if the user has data with additional peak annotations and if he/she requests them from the backend by specifically using e.g. columns = c("mz", "intensity", "peak_annotation")
in the peaksData
call.
Any opinion @lgatto @sgibb ?
I'm developing currently in the jomain branch. First change:
peaksData,MsBackendMemory
returns a list
of matrix
by default (with parameter columns = c("mz", "intensity")
.peaksData,MsBackendMemory
returns a list
of data.frame
s if additional/other peak variables are requested.that should enable to start working on ensuring peaksData,Spectra
works with filtering peak matrices again.
In Spectra
, the .peaksapply
will be the key function. That function ensures that the lazy processing queue is applied properly and the eventually filtered/changes peak data is returned.
The good news: this update already enables the main issue/problem discussed in the last dev call:
Create a Spectra
with an additional peak annotation:
library(Spectra)
df <- data.frame(rtime = c(1.1, 1.2, 1.3, 1.4),
msLevel = 1L)
df$mz <- list(c(13, 14.1, 22, 23, 24, 49),
c(45.1, 56),
c(34.3, 134.4, 344, 443),
c(12.1, 31))
df$intensity <- list(c(100, 300, 30, 120, 12, 34),
c(345, 234),
c(123, 124, 145, 3),
c(122, 421))
#' add some arbitrary information for each peak to the data.frame
df$ann <- list(c("a", NA, "b", "c", "d", NA),
c("e", "f"),
c("g", "h", "i", NA),
c("j", "k"))
B <- Spectra(df, peaksVariables = c("mz", "intensity", "ann"))
peaksData(B)[[1L]]
mz intensity
[1,] 13.0 100
[2,] 14.1 300
[3,] 22.0 30
[4,] 23.0 120
[5,] 24.0 12
[6,] 49.0 34
peaksData(B, columns = peaksVariables(B))[[1L]]
mz intensity ann
1 13.0 100 a
2 14.1 300 <NA>
3 22.0 30 b
4 23.0 120 c
5 24.0 12 d
6 49.0 34 <NA>
So, in the first case a matrix
is returned, in the second a data.frame
. This also enables (and does no longer break) filtering:
B2 <- filterMzValues(B, 23, tolerance = 1)
peaksData(B2, columns = c("mz", "intensity"))[[1L]]
mz intensity
[1,] 22 30
[2,] 23 120
[3,] 24 12
And including additional peak variables:
peaksData(B2, columns = peaksVariables(B))[[1L]]
mz intensity ann
3 22 30 b
4 23 120 c
5 24 12 d
What is not yet working (needs some fixes on the Spectra
level) is when individual peak variables are extracted:
B2$ann[[1L]]
[1] "a" NA "b" "c" "d" NA
In addition, spectraData,Spectra
needs to ensure peaks variables are correctly subsetted/processed if lazy processing queue is not empty.
spectraData
now also correctly subsets the peaks data if needed:
spectraData(B2, column = "ann")
DataFrame with 4 rows and 1 column
ann
<list>
1 b,c,d
2
3
4
And applyProcessing
correctly updates the peaks data and additional peaks variables in the backend:
B3 <- applyProcessing(B2)
B3@backend@peaksDataFrame
[[1]]
ann
3 b
4 c
5 d
[[2]]
[1] ann
<0 rows> (or 0-length row.names)
[[3]]
[1] ann
<0 rows> (or 0-length row.names)
[[4]]
[1] ann
<0 rows> (or 0-length row.names)
We need also to ensure that replacing peaks variables works properly with a non-empty processing queue (this is in fact a bug in the current implementation).
Replacing peaks variables works (in fact, throws an error if the processing queue is not empty). After applyProcessing
peaks variables can be replaced with spectraData<-
and $<-
. All this was added in the recent commits (in the jomain branch).
This is related to issue #287 and was discussed here: change the
MsBackend
API to supportpeaksData
to return alist
ofdata.frame
instead (or in addition) to alist
ofmatrix
. This would enable native support of peak variables (annotations) that are notnumeric
.This change should not break any functionality, but we need to check also if and how strong it's impact on the performance is.