rformassspectrometry / MsBackendTimsTof

Spectra backend supporting TimsTOF data files using the opentimsr package.
https://rformassspectrometry.github.io/MsBackendTimsTof/
7 stars 5 forks source link

Evaluate performance of serial vs parallel processing #16

Closed jorainer closed 2 years ago

jorainer commented 2 years ago

Check if parallel processing is possible with opentimsr/MsBackendTimsTof and compare performance against serial processing.

jorainer commented 2 years ago

opentimsr has its own parallel processing setup (opentims_set_threads) which clashes with BiocParallel-based parallel processing. So, we either perform the processing in parallel by file with BiocParallel (and need to disable opentimsr parallel processing) or we perform it in serial with opentimsr parallel processing enabled.

jorainer commented 2 years ago

Seems the parallel processing has a little benefit if large data files are processed:

fls <- c("TimsTOF/Methanolpos-1-TIMS_108_1_2007.d",
         "TimsTOF/SRM1950_20min_88_01_6950.d")
be <- backendInitialize(MsBackendTimsTof(), fls)

peakRAM(
{
    opentims_set_threads(1)
    MsBackendTimsTof:::.get_tims_columns(be, columns = c("mz", "intensity"))
},
{
    opentims_set_threads(2)
    MsBackendTimsTof:::.get_tims_columns(be, columns = c("mz", "intensity"))
},
{
    opentims_set_threads(1)
    MsBackendTimsTof:::.get_tims_columns_p(be, columns = c("mz", "intensity"), BPPARAM = SerialParam())
},
{
    opentims_set_threads(2)
    MsBackendTimsTof:::.get_tims_columns_p(be, columns = c("mz", "intensity"), BPPARAM = SerialParam())
},
MsBackendTimsTof:::.get_tims_columns_p(be, columns = c("mz", "intensity"), BPPARAM = MulticoreParam(2))
)

                                                                                                          Function_Call
1                         {opentims_set_threads(1)MsBackendTimsTof:::.get_tims_columns(be,columns=c("mz","intensity"))}
2                         {opentims_set_threads(2)MsBackendTimsTof:::.get_tims_columns(be,columns=c("mz","intensity"))}
3 {opentims_set_threads(1)MsBackendTimsTof:::.get_tims_columns_p(be,columns=c("mz","intensity"),BPPARAM=SerialParam())}
4 {opentims_set_threads(2)MsBackendTimsTof:::.get_tims_columns_p(be,columns=c("mz","intensity"),BPPARAM=SerialParam())}
5                      MsBackendTimsTof:::.get_tims_columns_p(be,columns=c("mz","intensity"),BPPARAM=MulticoreParam(2))
  Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1          108.674             1662.1            4507.6
2          105.141             1628.1            5559.2
3          109.299             1628.2            5457.3
4          106.240             1628.1            5457.2
5           80.762             1628.1            3306.4

So, the last call is the only one that uses parallel processing on a per-file basis. The per-file based parallel processing has advantages over the built-in parallel processing of opentimsr. There is only very little benefit for running that.

I fix not some internal things and replace the for-loop-based processing with the parallel version.

jorainer commented 2 years ago

Just realized that parallel processing in the backend makes no sense - parallel processing is taken care of by the Spectra object. Thus, the only thing we need is to disable opentimsr parallel processing to not interfere with BiocParallel.