rformassspectrometry / MsBackendSql

MsBackend adding support to store/access mass spectrometry data in large SQL databases through the Spectra R package.
https://rformassspectrometry.github.io/MsBackendSql/
4 stars 0 forks source link

Test performance on a large data set #2

Open jorainer opened 2 years ago

jorainer commented 2 years ago

Create a backend using various test data sets.

jorainer commented 2 years ago

CCLE data set

A MsqlBackend compliant database (MariaDB) was created for the full data which consists of 504 mzML files with in total 42,592,313 spectra (MS1 and MS2). The total sum of mass peaks for all these spectra is 19,381,366,123. A second MariaDB database was created with the option blob = TRUE introduced in PR #4 . In contrast to the former, this stores m/z and intensity of a spectrum as a BLOB in the database table reducing the size of the database tables considerably.

Performance is evaluated based on some frequently used tasks and compared to a Spectra with a MsBackendMzR backend. Parallel processing is disabled and the files and database are stored on SSD disks with ~ 1GB/s throughput.

## con is the connection to the MariaDB database
ccle <- Spectra(con, source = MsqlBackend())
## con_blob is the connection to the MariaDB database with m/z and intensity stored as BLOB
ccle_blob <- Spectra(con_blob, source = MsqlBackend())
ccle_mzr <- Spectra(fls, source = MsBackendMzR())

## Size comparison
print(object.size(ccle), units = "MB")
## 162.5 MB
print(object.size(ccle_mzr), units = "MB")
## 13506 MB

Since the MsqlBackend keeps only the primary keys in memory it's size is way smaller than the MsBackenMzR backed Spectra.

Subsetting by retention time range

system.time(
    ccle_rt <- filterRt(ccle, c(100, 110))
)
##   user  system elapsed
## 58.140   1.556 136.230

system.time(
    ccle_blob_rt <- filterRt(ccle_blob, c(100, 110))
)
##   user  system elapsed
## 34.038   1.796 118.031

system.time(
    ccle_mzr_rt <- filterRt(ccle_mzr, c(100, 110))
)
##  user  system elapsed
## 0.674   0.180   0.855

Filtering the MsBackendMzR is way faster since the data resides in memory. Getting the peaks data for these spectra:

length(ccle_mzr_rt)
[1] 14791

system.time(
    pd <- peaksData(ccle_rt)
)
##  user  system elapsed
## 2.174   0.030  19.202

system.time(
    pd_blob <- peaksData(ccle_blob_rt)
)
##  user  system elapsed
## 0.230   0.026   1.999

system.time(
    pd_mzr <- peaksData(ccle_mzr_rt)
)
##   user  system elapsed
## 65.183   3.825  69.416

Extracting peaks data from an MsqlBackend is faster. The performance is even better if the m/z and intensity values are stored as a BLOB.

Get MS2 spectra with precursor m/z matching a predefined value

pmz <- 940.1510

system.time(
    ccle_pmz <- filterPrecursorMzValues(ccle, pmz[1], tolerance = 0.01)
)
##  user  system elapsed
## 0.251   0.047  14.008

system.time(
    ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
##  user  system elapsed
## 0.265   0.000  14.707

system.time(
    ccle_mzr_pmz <- filterPrecursorMzValues(ccle_mzr, pmz[1], tolerance = 0.01)
)
##  user  system elapsed
## 3.041   0.146   3.191

length(ccle_mzr_pmz)
[1] 1693

Subsetting is again faster if the data is in memory, but extracting peaks data is faster with the database-backed backend.

system.time(
    pd <- peaksData(ccle_pmz)
)
##  user  system elapsed
## 0.233   0.040   6.569

system.time(
    pd_blob <- peaksData(ccle_blob_pmz)
)
##  user  system elapsed
## 0.043   0.003   1.064

system.time(
    pd_mzr <- peaksData(ccle_mzr_pmz)
)
##   user  system elapsed
## 64.462   3.819  69.489

Note that caching e.g. the precursor m/z values locally (in memory) speeds up the filtering operation (but will obviously increase the memory demand):

ccle$precursorMz <- precursorMz(ccle)
system.time(
    ccle_pmz <- filterPrecursorMzValues(ccle, pmz, tolerance = 0.01)
)
##  user  system elapsed
## 3.164   0.107   3.274

Conclusion

The MsqlBackend has a very low memory footprint and would also allow remote access to full data sets stored in a single place. Performance for simple subsetting operations is slower than if the data was in memory, but could be improved by loading data partially into memory (i.e. caching the data). Storing m/z and intensity values as BLOB in the database table results in better performance extracting peak data from the database. Thus, this option was set as default in PR #4 .

jorainer commented 2 years ago

Update: by adding additional indices to the database table the performance of the filter calls can be increased:

peakRAM(
    ccle_blob_pmz <- filterPrecursorMzValues(ccle_blob, pmz[1], tolerance = 0.01)
)
##                                                             Function_Call
## 1 ccle_blob_pmz<-filterPrecursorMzValues(ccle_blob,pmz[1],tolerance=0.01)
##   Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
## 1            0.315                  0               650

Thus, the filterPrecursorMzValues for MsqlBackend is even faster than for the MsBackendMzR (0.3 compared to 3 seconds)!

jorainer commented 2 years ago

@lgatto , @sgibb, just for information: seems that storing MS experiments in SQL databases (using the MsqlBackend) might be a nice alternative to conventional mzML-file based storage.

jorainer commented 1 year ago

HILIC Data Set

This are some performance comparisons using our in-house untargeted metabolomics dataset:

Tests were run on R version . CPU was a quad core 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz.

Compare size

> print(object.size(chris), units = "MB")
4639.7 Mb
> print(object.size(chris_db), units = "MB")
115.9 Mb 

The MsBackendSql keeps only the primary keys in memory and has thus a very low memory footprint.

Subset and filtering

Subset to 1000 random spectra (using indices)

> microbenchmark(
+ chris[idx],
+ chris_db[idx])
Unit: microseconds
          expr     min       lq     mean  median       uq      max neval cld
    chris[idx] 640.182 666.0300 721.2030 683.676 714.8640 3713.326   100   b
 chris_db[idx]  91.875 102.6875 115.2116 114.568 125.5975  177.565   100  a 

Filter based on MS level

> microbenchmark(
+ filterMsLevel(chris, 1L),
+ filterMsLevel(chris_db, 1L),
+ times = 7)
Unit: seconds
                        expr      min       lq      mean   median       uq
    filterMsLevel(chris, 1L) 2.355874 2.403097  2.550644 2.467742 2.660718
 filterMsLevel(chris_db, 1L) 9.629843 9.693403 10.741660 9.796958 9.972104
       max neval cld
  2.903265     7  a 
 16.433808     7   b

Filter by retention time

> microbenchmark(
+ filterRt(chris, rt = c(200, 300)),
+ filterRt(chris_db, rt = c(200, 300)),
+ times = 7)
Unit: milliseconds
                                 expr       min        lq     mean    median
    filterRt(chris, rt = c(200, 300))  953.2141  980.3457 1010.402  989.2629
 filterRt(chris_db, rt = c(200, 300)) 4749.7759 4808.0915 4829.327 4847.0458
       uq      max neval cld
 1038.790 1092.065     7  a 
 4853.335 4885.618     7   b

Subset by index index is much faster for the MsBackendSql because only the primary keys need to be subsetted. Filtering by MS level or retention time on the other hand is slower because a SQL query needs to be performed while for the MsBackendMzR subsetting is performed on the data already in memory.

Extracting peaks data

In the next tests we extract the MS data (peaksData) from subsets of the data. The tests are performed both on random indices as well as sorted random indices. For the MsBackendMzR we perform the tests also using parallel processing (3 workers) or using serial processing.

> idx <- sample(seq_along(chris), 10000)
> idx_o <- order(idx)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ peaksData(chris[idx_o]),
+ peaksData(chris[idx_o], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx_o]),
+ times = 7)
Unit: milliseconds
                                             expr        min         lq
                            peaksData(chris[idx]) 42190.7919 43377.2902
   peaksData(chris[idx], BPPARAM = SerialParam()) 36361.7414 37462.8143
                         peaksData(chris_db[idx])   854.4480   872.4723
                          peaksData(chris[idx_o])  3066.6946  3124.0278
 peaksData(chris[idx_o], BPPARAM = SerialParam())  5225.6786  5359.7452
                       peaksData(chris_db[idx_o])   704.8962   799.1920
       mean     median         uq        max neval   cld
 44075.1709 44183.8126 44922.2583 45552.4946     7     e
 38165.4711 37930.3878 38955.9492 40028.6411     7    d
   921.7522   909.4374   948.7440  1045.9471     7 a
  3210.9612  3191.9740  3278.5428  3412.9186     7  b
  5426.6894  5415.2606  5480.3372  5665.7216     7   c
   804.3953   815.3419   823.9607   864.2233     7 a

Extracting peaks data from spectra of 3 random files. MsBackendMzR parallel processing should benefit most from this.

> fls <- sample(unique(chris$dataOrigin), 3)
> idx <- which(chris$dataOrigin %in% fls)
> microbenchmark(
+ peaksData(chris[idx]),
+ peaksData(chris[idx], BPPARAM = SerialParam()),
+ peaksData(chris_db[idx]),
+ times = 13
+ )
Unit: milliseconds
                                           expr       min       lq      mean
                          peaksData(chris[idx]) 1999.4652 2060.613 2099.8230
 peaksData(chris[idx], BPPARAM = SerialParam()) 3204.7828 3253.108 3310.9872
                       peaksData(chris_db[idx])  514.2502  533.412  614.6944
   median        uq      max neval cld
 2077.436 2143.4068 2227.182    13  b 
 3301.743 3328.6203 3443.200    13   c
  548.369  570.3886 1403.045    13 a  
MikeKutz commented 1 year ago

You should probably run any benchmarks a few times to account for any potential DB caching.

jorainer commented 1 year ago

that's a good point - the times parameter in microbenchmark actually defines the number of times each test is run. But maybe it would be fair to disable caching alltogether for the SQL database to do fair tests.