rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
37 stars 25 forks source link

Improve performance of compareSpectra #116

Closed jorainer closed 4 years ago

jorainer commented 4 years ago

The compareSpectra function is not as fast as I would like to have it. Seems that the iterative calling of the .peaksapply function to realize the peak matrix (one at a time) is the main bottleneck. It would be faster to realize all peak matrices first and then to loop over them:

.compare_spectra_mem <- function(x, y, MAPFUN = joinPeaks, tolerance = 0,
                                 ppm = 20, FUN = ndotproduct, ...) {
    mat <- matrix(NA_real_, nrow = length(x), ncol = length(y),
                  dimnames = list(spectraNames(x), spectraNames(y)))
    x <- Spectra:::.peaksapply(x)
    y <- Spectra:::.peaksapply(y)
    for (i in seq_along(x)) {
        for (j in seq_along(y)) {
            peak_map <- MAPFUN(x[[i]], y[[j]],
                               tolerance = tolerance, ppm = ppm, ...)
            mat[i, j] <- FUN(peak_map[[1L]], peak_map[[2L]], ...)
            }
    }
    mat
}

library(microbenchmark)
microbenchmark(
    Spectra:::.compare_spectra(all_sps[1], all_sps[1:1000]),
    .compare_spectra_mem(all_sps[1], all_sps[1:1000]),
    times = 5)

Unit: milliseconds
                                                    expr      min        lq
 Spectra:::.compare_spectra(all_sps[1], all_sps[1:1000]) 5478.865 5814.8966
       .compare_spectra_mem(all_sps[1], all_sps[1:1000])  183.608  188.5381
      mean    median        uq       max neval cld
 5869.4490 5880.5818 6046.3839 6126.5182     5   b
  190.5342  190.2187  190.8331  199.4731     5  a 

However, requiring all data in memory would not work for very large scale data sets. One possibility would be to process the data in chunks, i.e. define a chunk size, e.g. 1000 and then always realize this number of spectra in memory and loop over them.

jorainer commented 4 years ago

I've implemented a compare_spectra_chunk function that realizes peak data in chunks. The performance comparison is shown below:

library(Spectra)
sciex_file <- normalizePath(
    dir(system.file("sciex", package = "msdata"), full.names = TRUE))
sciex_mzr <- backendInitialize(MsBackendMzR(), files = sciex_file)

sps_mzr <- Spectra(sciex_mzr)
sps_mem <- setBackend(sps_mzr, MsBackendDataFrame())

library(microbenchmark)
a <- Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr)
b <- Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr)
all.equal(a, b)
[1] TRUE

microbenchmark(
    Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr),
    Spectra:::.compare_spectra(sps_mem[1:4], sps_mem),
    Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr),
    Spectra:::.compare_spectra_chunk(sps_mem[1:4], sps_mem),
    Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 100),
    Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 500),
    Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 1000),
    times = 5)

Unit: seconds
                                                                      expr
                         Spectra:::.compare_spectra(sps_mzr[1:4], sps_mzr)
                         Spectra:::.compare_spectra(sps_mem[1:4], sps_mem)
                   Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mzr)
                   Spectra:::.compare_spectra_chunk(sps_mem[1:4], sps_mem)
  Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 100)
  Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 500)
 Spectra:::.compare_spectra_chunk(sps_mzr[1:4], sps_mem, chunkSize = 1000)
        min         lq       mean     median         uq        max neval cld
 110.745180 112.933538 114.244213 113.079181 117.004786 117.458378     5   c
  32.986638  33.340157  33.659584  33.755268  34.097984  34.117871     5  b 
   7.173629   7.196736   7.472038   7.565707   7.661782   7.762337     5 a  
   6.533229   6.686634   6.802989   6.702775   6.899326   7.192983     5 a  
   7.097447   7.111048   7.312058   7.346043   7.436461   7.569292     5 a  
   6.441530   6.728765   6.854443   6.762189   6.981145   7.358585     5 a  
   6.942868   7.036270   7.159263   7.073311   7.151368   7.592497     5 a  

Summarizing: chunk processing increases performance considerably and avoids also loading all data into memory.