rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

"chunkyfied" processing #249

Closed jorainer closed 1 year ago

jorainer commented 1 year ago

For large data sets it becomes inefficient (or even impossible) to run any function that processes the peaksData because the full peak data is loaded into memory. One example is the lengths function that simply reports the number of peaks per spectrum. This function internally (like many others) uses .peaksapply to ensure that any processings are applied to the peaks data. This function supports parallel processing which reduces the memory demand (only the peaks data of the currently processed files are loaded), but some backends (such as MsBackendSql) don't support parallel processing and hence the full data will be loaded and processed at once. We should evaluate whether it would not be possible to perform a chunk-wise processing of the data instead/in addition.

jorainer commented 1 year ago

A cheap immediate solution would be to use MulticoreParam(1L) and provide with parameter f a factor that allows splitting the Spectra (or rather the MsBackend) into chunks. With this only peaksData from one chunk is loaded at a time.

Still, would be better to have a dedicated function (or parameter) that allows to define the chunk-wise processing (eventually with additional parallel processing within each chunk?)

jorainer commented 1 year ago

Added a dedicated chunkapply function that splits and arbitrary input object x into chunks of size chunkSize and applies a function FUN to it. This function is also called by spectrapply,Spectra if parameter chunkSize is provided.

Evaluation of the function was performed on a data set with 500,000 spectra and using either a MsBackendSql or MsBackendMzR backend.

Spectra with a MsBackendSql backend:

> peakRAM(
+ res <- lengths(sps)
+ )
      Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 res<-lengths(sps)         2139.993                9.7           24099.9
> peakRAM(
+ res <- chunkapply(sps, lengths, chunkSize = 5000)
+ )
                                Function_Call Elapsed_Time_sec
1 res<-chunkapply(sps,lengths,chunkSize=5000)           47.549
  Total_RAM_Used_MiB Peak_RAM_Used_MiB
1                  0            3018.2

memory usage is thus much lower for the chunkapply. The same test using the same data set but with a MsBackendMzR backend:

> peakRAM(
+ res <- lengths(sps)
+ )
      Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 res<-lengths(sps)          252.199                3.3           23963.7
> peakRAM(
+ res <- chunkapply(sps, lengths, chunkSize = 5000)
+ )
                                Function_Call Elapsed_Time_sec
1 res<-chunkapply(sps,lengths,chunkSize=5000)          254.881
  Total_RAM_Used_MiB Peak_RAM_Used_MiB
1                  0              5309