Closed jorainer closed 1 year ago
A cheap immediate solution would be to use MulticoreParam(1L)
and provide with parameter f
a factor
that allows splitting the Spectra
(or rather the MsBackend
) into chunks. With this only peaksData
from one chunk is loaded at a time.
Still, would be better to have a dedicated function (or parameter) that allows to define the chunk-wise processing (eventually with additional parallel processing within each chunk?)
Added a dedicated chunkapply
function that splits and arbitrary input object x
into chunks of size chunkSize
and applies a function FUN
to it. This function is also called by spectrapply,Spectra
if parameter chunkSize
is provided.
Evaluation of the function was performed on a data set with 500,000 spectra and using either a MsBackendSql
or MsBackendMzR
backend.
Spectra
with a MsBackendSql
backend:
> peakRAM(
+ res <- lengths(sps)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 res<-lengths(sps) 2139.993 9.7 24099.9
> peakRAM(
+ res <- chunkapply(sps, lengths, chunkSize = 5000)
+ )
Function_Call Elapsed_Time_sec
1 res<-chunkapply(sps,lengths,chunkSize=5000) 47.549
Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 0 3018.2
memory usage is thus much lower for the chunkapply
. The same test using the same data set but with a MsBackendMzR
backend:
> peakRAM(
+ res <- lengths(sps)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 res<-lengths(sps) 252.199 3.3 23963.7
> peakRAM(
+ res <- chunkapply(sps, lengths, chunkSize = 5000)
+ )
Function_Call Elapsed_Time_sec
1 res<-chunkapply(sps,lengths,chunkSize=5000) 254.881
Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 0 5309
For large data sets it becomes inefficient (or even impossible) to run any function that processes the
peaksData
because the full peak data is loaded into memory. One example is thelengths
function that simply reports the number of peaks per spectrum. This function internally (like many others) uses.peaksapply
to ensure that any processings are applied to the peaks data. This function supports parallel processing which reduces the memory demand (only the peaks data of the currently processed files are loaded), but some backends (such asMsBackendSql
) don't support parallel processing and hence the full data will be loaded and processed at once. We should evaluate whether it would not be possible to perform a chunk-wise processing of the data instead/in addition.