Closed jorainer closed 4 years ago
I've implemented a compare_spectra_chunk
function that realizes peak data in chunks. The performance comparison is shown below:
library(Spectra)
sciex_file <- normalizePath(
dir(system.file("sciex", package = "msdata"), full.names = TRUE))
sciex_mzr <- backendInitialize(MsBackendMzR(), files = sciex_file)
sps_mzr <- Spectra(sciex_mzr)
sps_mem <- setBackend(sps_mzr, MsBackendDataFrame())
library(microbenchmark)
a <- Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr)
b <- Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr)
all.equal(a, b)
[1] TRUE
microbenchmark(
Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr),
Spectra:::.compare_spectra(sps_mem[1:4], sps_mem),
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr),
Spectra:::.compare_spectra_chunk(sps_mem[1:4], sps_mem),
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 100),
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 500),
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 1000),
times = 5)
Unit: seconds
expr
Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr)
Spectra:::.compare_spectra(sps_mem[1:4], sps_mem)
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr)
Spectra:::.compare_spectra_chunk(sps_mem[1:4], sps_mem)
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 100)
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 500)
Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 1000)
min lq mean median uq max neval cld
110.745180 112.933538 114.244213 113.079181 117.004786 117.458378 5 c
32.986638 33.340157 33.659584 33.755268 34.097984 34.117871 5 b
7.173629 7.196736 7.472038 7.565707 7.661782 7.762337 5 a
6.533229 6.686634 6.802989 6.702775 6.899326 7.192983 5 a
7.097447 7.111048 7.312058 7.346043 7.436461 7.569292 5 a
6.441530 6.728765 6.854443 6.762189 6.981145 7.358585 5 a
6.942868 7.036270 7.159263 7.073311 7.151368 7.592497 5 a
Summarizing: chunk processing increases performance considerably and avoids also loading all data into memory.
The
compareSpectra
function is not as fast as I would like to have it. Seems that the iterative calling of the.peaksapply
function to realize the peak matrix (one at a time) is the main bottleneck. It would be faster to realize all peak matrices first and then to loop over them:However, requiring all data in memory would not work for very large scale data sets. One possibility would be to process the data in chunks, i.e. define a chunk size, e.g. 1000 and then always realize this number of spectra in memory and loop over them.