Closed jorainer closed 6 months ago
A possibility would be to add a specific parameter for this to any of the user-facing functions: mz
, intensity
, peaksData
, bin
, ionCount
, lengths
etc.
A parameter chunkSize = integer()
could be used. If length(chunkSize)
, a factor could be created to be passed to parameter f
of the internal .peaksapply
, hence forcing to perform splitted (eventually parallel) processing. With the default chunkSize = integer()
per-file parallel processing would be performed for Spectra
with different levels for dataStorage
(as it is currently the default).
An example for the high memory requirement: determine the number of peaks per spectrum from a Spectra
using a MsBackendSql
backend - limiting to 500,000 spectra to avoid running out of memory:
> peakRAM(tmp <- lengths(sps[1:500000]))
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 tmp<-lengths(sps[1:5e+05]) 63.094 6.4 32650.3
Thus, to complete this simple operation 32GB of memory are required, because the full peaks data is first loaded into memory before the number of peaks is determined.
After some tests and evaluating the code: passing parameters to individual functions might not be the ideal solution (also because some functions don't allow that, such as lengths
).
An alternative would be to have a global setting processingChunkSize
for a Spectra
that allows the user to set a chunk size for any peaks data access function:
processingChunkSize(sps) <- 10000
any subsequent operation that calculates something on the peaks data will then load and process the data in chunks of 10000. We just have to ensure that functions like combineSpectra
for which the user defines a factor to split and process the data ignore this global setting.
For some functions the chunk-wise processing needs to be disabled:
combineSpectra
: the Spectra
needs to be split by the factor passed by the user because this set(s) of spectra need to be combined. Thus, any eventual chunk size should be ignored. Note that the function has a parameter p
that allows to define how to split the data for parallel processing (and hence processes the data also by chunks).compareSpectra
: uses a different (internal) chunk processing. we need to compare all spectra against all, so separate processing as with the proposed chunk-wise processing would not work (likely?).This is currently being implemented in the processingChunkSize branch.
Seems impossible to check/benchmark the memory usage properly from within R (at least I did not find any option). So, I'll perform some more tests locally and check if/when I'm running out of memory and how to prevent that.
Here we evaluate how the chunk-wise processing works on a large data set (300 samples, 516,299 spectra). Memory estimation might not be correct, still we report here run time and (approximated/estimated) memory usage using peakRAM
. The code was run (within a docker container) on a computer with 64GB of main memory (additional 64GB of swap memory). As a backend the MsBackendOfflineSql
was used and the MS data was stored in a SQLite database (file).
Performance on two Spectra
objects is tested: a
and b
, the former using a MsBackendOfflineSql
without chunk-wise processing, the latter an MsBackendOfflineSql
with processingChunkSize
set to 100,000:
> length(a)
[1] 516299
> b <- a
> processingChunkSize(b) <- 100000
Next we filter the data removing peaks with an intensity lower 300 - this operation is just added to the processing queue.
> library(peakRAM)
> peakRAM(
+ a <- filterIntensity(a, intensity = 300),
+ b <- filterIntensity(b, intensity = 300)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-filterIntensity(a,intensity=300) 0.127 0 0.1
2 b<-filterIntensity(b,intensity=300) 0.062 0 0.1
Next we bin
the data. This retrieves the range of the m/z values from the full data set and then adds the binning function to the processing queue.
> peakRAM(
+ a <- bin(a, binSize = 0.1),
+ b <- bin(b, binSize = 0.2)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-bin(a,binSize=0.1) 507.889 4.1 58706.8
2 b<-bin(b,binSize=0.2) 301.128 0.8 35783.2
Here, the Spectra
with the chunk-wise processing could perform the step faster and (if we can indeed trust the memory estimates) with a lower memory usage.
At last we calculate the tic for each spectrum. This call will require loading the data and executing the processing queue.
> peakRAM(
+ ionCount(a),
+ ionCount(b)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 ionCount(a) 552.087 3.8 55898.9
2 ionCount(b) 366.852 3.9 26741.5
Chunk-wise processing thus reduced both run-time and memory demand. As a side effect, chunk-wise processing would enable also parallel processing, as a set of chunks can then be processed in parallel.
This is now implemented and activated by default. Closing issue.
By using a
MsBackendSql
for a very large MS experiment we ran into memory issues (see issue #303 ) processing the data. All operations working on peaks data use the internal.peaksapply
function that by default splits the processing based ondataStorage(x)
and performs the operation in parallel (even if processing might only mean loading the data from the raw data files). For in memory backends and the e.g.MsBackendSql
,dataStorage
will only have a single value, thus no splitting/parallel processing is performed.Advantages of splitting and parallel processing
These advantages are good for very large data sets. The downside of splitting and parallel processing is however also the overhead of splitting and combining the data. Thus, in many cases (e.g. also for in-memory backends) it might actually be good to not do it.
Maybe it would make sense to add a parameter or let the user choose to enable parallel/split processing, even if backends other than
MsBackendMzR
are used.