Open jorainer opened 2 years ago
A MsqlBackend
compliant database (MariaDB) was created for the full data which consists of 504 mzML files with in total 42,592,313 spectra (MS1 and MS2). The total sum of mass peaks for all these spectra is 19,381,366,123. A second MariaDB database was created with the option blob = TRUE
introduced in PR #4 . In contrast to the former, this stores m/z and intensity of a spectrum as a BLOB in the database table reducing the size of the database tables considerably.
Performance is evaluated based on some frequently used tasks and compared to a Spectra
with a MsBackendMzR
backend. Parallel processing is disabled and the files and database are stored on SSD disks with ~ 1GB/s throughput.
## con is the connection to the MariaDB database
ccle <- Spectra(con, source = MsqlBackend())
## con_blob is the connection to the MariaDB database with m/z and intensity stored as BLOB
ccle_blob <- Spectra(con_blob, source = MsqlBackend())
ccle_mzr <- Spectra(fls, source = MsBackendMzR())
## Size comparison
print(object.size(ccle), units = "MB")
## 162.5 MB
print(object.size(ccle_mzr), units = "MB")
## 13506 MB
Since the MsqlBackend
keeps only the primary keys in memory it's size is way smaller than the MsBackenMzR
backed Spectra
.
system.time(
ccle_rt <- filterRt(ccle, c(100, 110))
)
## user system elapsed
## 58.140 1.556 136.230
system.time(
ccle_blob_rt <- filterRt(ccle_blob, c(100, 110))
)
## user system elapsed
## 34.038 1.796 118.031
system.time(
ccle_mzr_rt <- filterRt(ccle_mzr, c(100, 110))
)
## user system elapsed
## 0.674 0.180 0.855
Filtering the MsBackendMzR
is way faster since the data resides in memory. Getting the peaks data for these spectra:
length(ccle_mzr_rt)
[1] 14791
system.time(
pd <- peaksData(ccle_rt)
)
## user system elapsed
## 2.174 0.030 19.202
system.time(
pd_blob <- peaksData(ccle_blob_rt)
)
## user system elapsed
## 0.230 0.026 1.999
system.time(
pd_mzr <- peaksData(ccle_mzr_rt)
)
## user system elapsed
## 65.183 3.825 69.416
Extracting peaks data from an MsqlBackend
is faster. The performance is even better if the m/z and intensity values are stored as a BLOB.
pmz <- 940.1510
system.time(
ccle_pmz <- filterPrecursorMzValues(ccle, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 0.251 0.047 14.008
system.time(
ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 0.265 0.000 14.707
system.time(
ccle_mzr_pmz <- filterPrecursorMzValues(ccle_mzr, pmz[1], tolerance = 0.01)
)
## user system elapsed
## 3.041 0.146 3.191
length(ccle_mzr_pmz)
[1] 1693
Subsetting is again faster if the data is in memory, but extracting peaks data is faster with the database-backed backend.
system.time(
pd <- peaksData(ccle_pmz)
)
## user system elapsed
## 0.233 0.040 6.569
system.time(
pd_blob <- peaksData(ccle_blob_pmz)
)
## user system elapsed
## 0.043 0.003 1.064
system.time(
pd_mzr <- peaksData(ccle_mzr_pmz)
)
## user system elapsed
## 64.462 3.819 69.489
Note that caching e.g. the precursor m/z values locally (in memory) speeds up the filtering operation (but will obviously increase the memory demand):
ccle$precursorMz <- precursorMz(ccle)
system.time(
ccle_pmz <- filterPrecursorMzValues(ccle, pmz, tolerance = 0.01)
)
## user system elapsed
## 3.164 0.107 3.274
The MsqlBackend
has a very low memory footprint and would also allow remote access to full data sets stored in a single place. Performance for simple subsetting operations is slower than if the data was in memory, but could be improved by loading data partially into memory (i.e. caching the data).
Storing m/z and intensity values as BLOB in the database table results in better performance extracting peak data from the database. Thus, this option was set as default in PR #4 .
Update: by adding additional indices to the database table the performance of the filter calls can be increased:
peakRAM(
ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
## Function_Call
## 1 ccle_blob_pmz<-filterPrecursorMzValues(ccle_blob,pmz[1],tolerance=0.01)
## Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
## 1 0.315 0 650
Thus, the filterPrecursorMzValues
for MsqlBackend
is even faster than for the MsBackendMzR
(0.3 compared to 3 seconds)!
@lgatto , @sgibb, just for information: seems that storing MS experiments in SQL databases (using the MsqlBackend
) might be a nice alternative to conventional mzML-file based storage.
This are some performance comparisons using our in-house untargeted metabolomics dataset:
BLOB
Tests were run on R version . CPU was a quad core 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz.
> print(object.size(chris), units = "MB")
4639.7 Mb
> print(object.size(chris_db), units = "MB")
115.9 Mb
The MsBackendSql
keeps only the primary keys in memory and has thus a very low memory footprint.
Subset to 1000 random spectra (using indices)
> microbenchmark(
+ chris[idx],
+ chris_db[idx])
Unit: microseconds
expr min lq mean median uq max neval cld
chris[idx] 640.182 666.0300 721.2030 683.676 714.8640 3713.326 100 b
chris_db[idx] 91.875 102.6875 115.2116 114.568 125.5975 177.565 100 a
Filter based on MS level
> microbenchmark(
+ filterMsLevel(chris, 1L),
+ filterMsLevel(chris_db, 1L),
+ times = 7)
Unit: seconds
expr min lq mean median uq
filterMsLevel(chris, 1L) 2.355874 2.403097 2.550644 2.467742 2.660718
filterMsLevel(chris_db, 1L) 9.629843 9.693403 10.741660 9.796958 9.972104
max neval cld
2.903265 7 a
16.433808 7 b
Filter by retention time
> microbenchmark(
+ filterRt(chris, rt = c(200, 300)),
+ filterRt(chris_db, rt = c(200, 300)),
+ times = 7)
Unit: milliseconds
expr min lq mean median
filterRt(chris, rt = c(200, 300)) 953.2141 980.3457 1010.402 989.2629
filterRt(chris_db, rt = c(200, 300)) 4749.7759 4808.0915 4829.327 4847.0458
uq max neval cld
1038.790 1092.065 7 a
4853.335 4885.618 7 b
Subset by index index is much faster for the MsBackendSql
because only the primary keys need to be subsetted. Filtering by MS level or retention time on the other hand is slower because a SQL query needs to be performed while for the MsBackendMzR
subsetting is performed on the data already in memory.
In the next tests we extract the MS data (peaksData
) from subsets of the data. The tests are performed both on random indices as well as sorted random indices. For the MsBackendMzR
we perform the tests also using parallel processing (3 workers) or using serial processing.
> idx <- sample(seq_along(chris), 10000)
> idx_o <- order(idx)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ peaksData(chris[idx_o]),
+ peaksData(chris[idx_o], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx_o]),
+ times = 7)
Unit: milliseconds
expr min lq
peaksData(chris[idx]) 42190.7919 43377.2902
peaksData(chris[idx], BPPARAM = SerialParam()) 36361.7414 37462.8143
peaksData(chris_db[idx]) 854.4480 872.4723
peaksData(chris[idx_o]) 3066.6946 3124.0278
peaksData(chris[idx_o], BPPARAM = SerialParam()) 5225.6786 5359.7452
peaksData(chris_db[idx_o]) 704.8962 799.1920
mean median uq max neval cld
44075.1709 44183.8126 44922.2583 45552.4946 7 e
38165.4711 37930.3878 38955.9492 40028.6411 7 d
921.7522 909.4374 948.7440 1045.9471 7 a
3210.9612 3191.9740 3278.5428 3412.9186 7 b
5426.6894 5415.2606 5480.3372 5665.7216 7 c
804.3953 815.3419 823.9607 864.2233 7 a
Extracting peaks data from spectra of 3 random files. MsBackendMzR
parallel processing should benefit most from this.
> fls <- sample(unique(chris$dataOrigin), 3)
> idx <- which(chris$dataOrigin %in% fls)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ times = 13
+ )
Unit: milliseconds
expr min lq mean
peaksData(chris[idx]) 1999.4652 2060.613 2099.8230
peaksData(chris[idx], BPPARAM = SerialParam()) 3204.7828 3253.108 3310.9872
peaksData(chris_db[idx]) 514.2502 533.412 614.6944
median uq max neval cld
2077.436 2143.4068 2227.182 13 b
3301.743 3328.6203 3443.200 13 c
548.369 570.3886 1403.045 13 a
MsBackendSql
outperforms the MsBackendMzR
in extracting peaks data.MsBackendMzR
.You should probably run any benchmarks a few times to account for any potential DB caching.
that's a good point - the times
parameter in microbenchmark
actually defines the number of times each test is run. But maybe it would be fair to disable caching alltogether for the SQL database to do fair tests.
Create a backend using various test data sets.