rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Memory issues and (potential) chunk-wise processing of peaks data #304

Closed jorainer closed 6 months ago

jorainer commented 7 months ago

By using a MsBackendSql for a very large MS experiment we ran into memory issues (see issue #303 ) processing the data. All operations working on peaks data use the internal .peaksapply function that by default splits the processing based on dataStorage(x) and performs the operation in parallel (even if processing might only mean loading the data from the raw data files). For in memory backends and the e.g. MsBackendSql, dataStorage will only have a single value, thus no splitting/parallel processing is performed.

Advantages of splitting and parallel processing

These advantages are good for very large data sets. The downside of splitting and parallel processing is however also the overhead of splitting and combining the data. Thus, in many cases (e.g. also for in-memory backends) it might actually be good to not do it.

Maybe it would make sense to add a parameter or let the user choose to enable parallel/split processing, even if backends other than MsBackendMzR are used.

jorainer commented 7 months ago

A possibility would be to add a specific parameter for this to any of the user-facing functions: mz, intensity, peaksData, bin, ionCount, lengths etc.

A parameter chunkSize = integer() could be used. If length(chunkSize), a factor could be created to be passed to parameter f of the internal .peaksapply, hence forcing to perform splitted (eventually parallel) processing. With the default chunkSize = integer() per-file parallel processing would be performed for Spectra with different levels for dataStorage (as it is currently the default).

jorainer commented 7 months ago

An example for the high memory requirement: determine the number of peaks per spectrum from a Spectra using a MsBackendSql backend - limiting to 500,000 spectra to avoid running out of memory:

> peakRAM(tmp <- lengths(sps[1:500000]))
               Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 tmp<-lengths(sps[1:5e+05])           63.094                6.4           32650.3

Thus, to complete this simple operation 32GB of memory are required, because the full peaks data is first loaded into memory before the number of peaks is determined.

jorainer commented 7 months ago

After some tests and evaluating the code: passing parameters to individual functions might not be the ideal solution (also because some functions don't allow that, such as lengths).

An alternative would be to have a global setting processingChunkSize for a Spectra that allows the user to set a chunk size for any peaks data access function:

processingChunkSize(sps) <- 10000

any subsequent operation that calculates something on the peaks data will then load and process the data in chunks of 10000. We just have to ensure that functions like combineSpectra for which the user defines a factor to split and process the data ignore this global setting.

jorainer commented 7 months ago

For some functions the chunk-wise processing needs to be disabled:

jorainer commented 7 months ago

This is currently being implemented in the processingChunkSize branch.

jorainer commented 7 months ago

Seems impossible to check/benchmark the memory usage properly from within R (at least I did not find any option). So, I'll perform some more tests locally and check if/when I'm running out of memory and how to prevent that.

jorainer commented 7 months ago

Evaluation and performance comparison of chunk-wise processing

Here we evaluate how the chunk-wise processing works on a large data set (300 samples, 516,299 spectra). Memory estimation might not be correct, still we report here run time and (approximated/estimated) memory usage using peakRAM. The code was run (within a docker container) on a computer with 64GB of main memory (additional 64GB of swap memory). As a backend the MsBackendOfflineSql was used and the MS data was stored in a SQLite database (file).

Performance on two Spectra objects is tested: a and b, the former using a MsBackendOfflineSql without chunk-wise processing, the latter an MsBackendOfflineSql with processingChunkSize set to 100,000:

> length(a)
[1] 516299
> b <- a
> processingChunkSize(b) <- 100000

Next we filter the data removing peaks with an intensity lower 300 - this operation is just added to the processing queue.

> library(peakRAM)
> peakRAM(
+ a <- filterIntensity(a, intensity = 300),
+ b <- filterIntensity(b, intensity = 300)
+ )
                        Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-filterIntensity(a,intensity=300)            0.127                  0               0.1
2 b<-filterIntensity(b,intensity=300)            0.062                  0               0.1

Next we bin the data. This retrieves the range of the m/z values from the full data set and then adds the binning function to the processing queue.

> peakRAM(
+ a <- bin(a, binSize = 0.1),
+ b <- bin(b, binSize = 0.2)
+ )
          Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-bin(a,binSize=0.1)          507.889                4.1           58706.8
2 b<-bin(b,binSize=0.2)          301.128                0.8           35783.2

Here, the Spectra with the chunk-wise processing could perform the step faster and (if we can indeed trust the memory estimates) with a lower memory usage.

At last we calculate the tic for each spectrum. This call will require loading the data and executing the processing queue.

> peakRAM(
+ ionCount(a),
+ ionCount(b)
+ )
   Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 ionCount(a)          552.087                3.8           55898.9
2 ionCount(b)          366.852                3.9           26741.5

Chunk-wise processing thus reduced both run-time and memory demand. As a side effect, chunk-wise processing would enable also parallel processing, as a set of chunks can then be processed in parallel.

jorainer commented 6 months ago

This is now implemented and activated by default. Closing issue.