MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:
Both databases (SQLite and MariaDB) stored on the same hard disk/partition (internal nVME disk, thus high data I/O is expected).
LC-MS data from 8,804 samples (mzML) files stored to the databases: in total 15,151,673 spectra.
Size of the SQLite database: 825GB
Size of the MariaDB database: 836GB
MariaDB database uses the Aria storage engine.
mse_maria and mse_sqlite below are two MsExperiment objects with the MS data represented by a MsBackendOfflineSQL backend.
#' Accessing spectra variables:
microbenchmark(msLevel(spectra(mse_maria)),
msLevel(spectra(mse_sqlite)),
times = 7)
Unit: seconds
expr min lq mean median uq max neval cld
msLevel(spectra(mse_maria)) 33.54925 33.76008 35.02384 34.78424 35.98299 37.34726 7 a
msLevel(spectra(mse_sqlite)) 21.64151 21.82196 22.85622 21.97240 22.67517 27.38539 7 b
SQLite is thus about 10 seconds faster extracting MS levels for all spectra.
#' unique MS levels
microbenchmark(uniqueMsLevels(spectra(mse_maria)),
uniqueMsLevels(spectra(mse_sqlite)),
times = 7)
Unit: milliseconds
expr min lq mean median uq
uniqueMsLevels(spectra(mse_maria)) 1.542222 1.637237 2.053933 1.667836 2.557647
uniqueMsLevels(spectra(mse_sqlite)) 256.162693 256.191937 258.977613 257.817538 260.077294
max neval cld
2.777705 7 a
266.324599 7 b
uniqueMsLevels uses a select distinct... call to extract unique MS levels. MariaDB is here by far faster.
#' Filtering the dataset based on retention times: filterRt 200, 300
microbenchmark(filterRt(spectra(mse_maria), rt = c(200, 300)),
filterRt(spectra(mse_sqlite), rt = c(200, 300)),
times = 7)
Unit: seconds
expr min lq mean median uq
filterRt(spectra(mse_maria), rt = c(200, 300)) 2.445303 2.470176 2.477086 2.478997 2.482781
filterRt(spectra(mse_sqlite), rt = c(200, 300)) 2.242923 2.255827 2.556260 2.264236 2.324919
max neval cld
2.509388 7 a
4.225171 7 a
About the same performance from both. filterRt uses a SQL-based filtering on the "rtime" spectra variable, i.e. performs the filtering within the database.
Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.
#' Access data from random 10 samples.
set.seed(123)
idx <- sample(seq_along(mse_maria), 10)
mse_maria_sub <- mse_maria[idx]
mse_sqlite_sub <- mse_sqlite[idx]
microbenchmark(msLevel(spectra(mse_maria_sub)),
msLevel(spectra(mse_sqlite_sub)),
times = 7)
Unit: milliseconds
expr min lq mean median uq max neval cld
msLevel(spectra(mse_maria_sub)) 45.60593 45.79210 46.42101 46.29717 46.92385 47.61207 7 a
msLevel(spectra(mse_sqlite_sub)) 24.49343 25.12833 25.23884 25.17525 25.47306 25.80040 7 b
Again, accessing a single spectra variables is faster with SQLite.
#' Filtering by retention time in the data subset
microbenchmark(filterRt(spectra(mse_maria_sub), rt = c(200, 300)),
filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)),
times = 7)
## Unit: milliseconds
## expr min lq mean median
## filterRt(spectra(mse_maria_sub), rt = c(200, 300)) 37.27201 38.1075 39.84782 39.81546
## filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)) 2320.30354 2324.0109 2326.98962 2325.46668
## uq max neval cld
## 41.74291 42.14641 7 a
## 2328.67538 2337.78457 7 b
Filtering by retention time within the data subset is much faster using the MariaDB database.
#' Accessing actual peak values: m/z values
microbenchmark(mz(spectra(mse_maria_sub)),
mz(spectra(mse_sqlite_sub)),
times = 7)
Unit: seconds
expr min lq mean median uq max neval cld
mz(spectra(mse_maria_sub)) 1.016480 1.284119 1.378519 1.500377 1.517393 1.529748 7 a
mz(spectra(mse_sqlite_sub)) 1.037934 1.100142 1.314109 1.101201 1.108407 2.642530 7 a
Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use chromatogram to extract base peak chromatograms of the MS data per sample.
library(xcms)
register(SerialParam())
microbenchmark(chromatogram(mse_maria_sub),
chromatogram(mse_sqlite_sub),
times = 7)
Unit: seconds
expr min lq mean median uq max neval cld
chromatogram(mse_maria_sub) 2.703687 2.812147 3.058470 3.208331 3.255489 3.362001 7 a
chromatogram(mse_sqlite_sub) 2.373239 2.477671 2.750986 2.609907 3.017549 3.283317 7 a
Performance is comparable. At last we combine that also with a filter for retention times.
microbenchmark(chromatogram(mse_maria_sub, rt = c(25, 40)),
chromatogram(mse_sqlite_sub, rt = c(25, 40)),
times = 7)
Unit: milliseconds
expr min lq mean median
chromatogram(mse_maria_sub, rt = c(20, 100)) 651.9479 681.5439 708.7386 725.0281
chromatogram(mse_sqlite_sub, rt = c(20, 100)) 9999.5411 10004.3932 10056.3023 10027.4849
uq max neval cld
736.276 748.5543 7 a
10097.284 10163.7361 7 b
Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.
Summary
For most operations both SQLite and MariaDB database engines are about equally performant.
For data access involving more complex queries (i.e. that combine retention time values and primary keys such as for filtering spectra within a subset of samples from the full data set) MariaDB has clear advantages while for plain access of individual spectra variables SQLite is faster.
MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:
mse_maria
andmse_sqlite
below are twoMsExperiment
objects with the MS data represented by aMsBackendOfflineSQL
backend.SQLite
is thus about 10 seconds faster extracting MS levels for all spectra.uniqueMsLevels
uses aselect distinct...
call to extract unique MS levels. MariaDB is here by far faster.About the same performance from both.
filterRt
uses a SQL-based filtering on the"rtime"
spectra variable, i.e. performs the filtering within the database.Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.
Again, accessing a single spectra variables is faster with SQLite.
Filtering by retention time within the data subset is much faster using the MariaDB database.
Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use
chromatogram
to extract base peak chromatograms of the MS data per sample.Performance is comparable. At last we combine that also with a filter for retention times.
Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.
Summary