Performance comparison SQLite vs MariaDB

MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:

Both databases (SQLite and MariaDB) stored on the same hard disk/partition (internal nVME disk, thus high data I/O is expected).
LC-MS data from 8,804 samples (mzML) files stored to the databases: in total 15,151,673 spectra.
Size of the SQLite database: 825GB
Size of the MariaDB database: 836GB
MariaDB database uses the Aria storage engine.

mse_maria and mse_sqlite below are two MsExperiment objects with the MS data represented by a MsBackendOfflineSQL backend.

#' Accessing spectra variables:
microbenchmark(msLevel(spectra(mse_maria)),
               msLevel(spectra(mse_sqlite)),
               times = 7)
Unit: seconds
                          expr      min       lq     mean   median       uq      max neval cld
   msLevel(spectra(mse_maria)) 33.54925 33.76008 35.02384 34.78424 35.98299 37.34726     7  a
  msLevel(spectra(mse_sqlite)) 21.64151 21.82196 22.85622 21.97240 22.67517 27.38539     7   b

SQLite is thus about 10 seconds faster extracting MS levels for all spectra.

#' unique MS levels
microbenchmark(uniqueMsLevels(spectra(mse_maria)),
               uniqueMsLevels(spectra(mse_sqlite)),
               times = 7)
 Unit: milliseconds
                                 expr        min         lq       mean     median         uq
   uniqueMsLevels(spectra(mse_maria))   1.542222   1.637237   2.053933   1.667836   2.557647
  uniqueMsLevels(spectra(mse_sqlite)) 256.162693 256.191937 258.977613 257.817538 260.077294
         max neval cld
    2.777705     7  a
  266.324599     7   b

uniqueMsLevels uses a select distinct... call to extract unique MS levels. MariaDB is here by far faster.

#' Filtering the dataset based on retention times: filterRt 200, 300
microbenchmark(filterRt(spectra(mse_maria), rt = c(200, 300)),
               filterRt(spectra(mse_sqlite), rt = c(200, 300)),
               times = 7)
 Unit: seconds
                                             expr      min       lq     mean   median       uq
   filterRt(spectra(mse_maria), rt = c(200, 300)) 2.445303 2.470176 2.477086 2.478997 2.482781
  filterRt(spectra(mse_sqlite), rt = c(200, 300)) 2.242923 2.255827 2.556260 2.264236 2.324919
       max neval cld
  2.509388     7   a
  4.225171     7   a

About the same performance from both. filterRt uses a SQL-based filtering on the "rtime" spectra variable, i.e. performs the filtering within the database.

Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.

#' Access data from random 10 samples.
set.seed(123)
idx <- sample(seq_along(mse_maria), 10)
mse_maria_sub <- mse_maria[idx]
mse_sqlite_sub <- mse_sqlite[idx]

microbenchmark(msLevel(spectra(mse_maria_sub)),
               msLevel(spectra(mse_sqlite_sub)),
               times = 7)
 Unit: milliseconds
                              expr      min       lq     mean   median       uq      max neval cld
   msLevel(spectra(mse_maria_sub)) 45.60593 45.79210 46.42101 46.29717 46.92385 47.61207     7  a
  msLevel(spectra(mse_sqlite_sub)) 24.49343 25.12833 25.23884 25.17525 25.47306 25.80040     7   b

Again, accessing a single spectra variables is faster with SQLite.

#' Filtering by retention time in the data subset
microbenchmark(filterRt(spectra(mse_maria_sub), rt = c(200, 300)),
               filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)),
               times = 7)
## Unit: milliseconds
##                                                 expr        min        lq       mean     median
##   filterRt(spectra(mse_maria_sub), rt = c(200, 300))   37.27201   38.1075   39.84782   39.81546
##  filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)) 2320.30354 2324.0109 2326.98962 2325.46668
##          uq        max neval cld
##    41.74291   42.14641     7  a
##  2328.67538 2337.78457     7   b

Filtering by retention time within the data subset is much faster using the MariaDB database.

#' Accessing actual peak values: m/z values
microbenchmark(mz(spectra(mse_maria_sub)),
               mz(spectra(mse_sqlite_sub)),
               times = 7)
 Unit: seconds
                         expr      min       lq     mean   median       uq      max neval cld
   mz(spectra(mse_maria_sub)) 1.016480 1.284119 1.378519 1.500377 1.517393 1.529748     7   a
  mz(spectra(mse_sqlite_sub)) 1.037934 1.100142 1.314109 1.101201 1.108407 2.642530     7   a

Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use chromatogram to extract base peak chromatograms of the MS data per sample.

library(xcms)
register(SerialParam())
microbenchmark(chromatogram(mse_maria_sub),
               chromatogram(mse_sqlite_sub),
               times = 7)
 Unit: seconds
                          expr      min       lq     mean   median       uq      max neval cld
   chromatogram(mse_maria_sub) 2.703687 2.812147 3.058470 3.208331 3.255489 3.362001     7   a
  chromatogram(mse_sqlite_sub) 2.373239 2.477671 2.750986 2.609907 3.017549 3.283317     7   a

Performance is comparable. At last we combine that also with a filter for retention times.

microbenchmark(chromatogram(mse_maria_sub, rt = c(25, 40)),
               chromatogram(mse_sqlite_sub, rt = c(25, 40)),
               times = 7)
 Unit: milliseconds
                                           expr       min         lq       mean     median
   chromatogram(mse_maria_sub, rt = c(20, 100))  651.9479   681.5439   708.7386   725.0281
  chromatogram(mse_sqlite_sub, rt = c(20, 100)) 9999.5411 10004.3932 10056.3023 10027.4849
         uq        max neval cld
    736.276   748.5543     7  a
  10097.284 10163.7361     7   b

Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.

Summary

For most operations both SQLite and MariaDB database engines are about equally performant.
For data access involving more complex queries (i.e. that combine retention time values and primary keys such as for filtering spectra within a subset of samples from the full data set) MariaDB has clear advantages while for plain access of individual spectra variables SQLite is faster.

rformassspectrometry / MsBackendSql

Performance comparison SQLite vs MariaDB #15

Summary