Enable chunk-wise processing for all peaks data functions

jorainer commented 7 months ago

This PR fixes issue #304 . In brief: it adds the possibility for the user to set and define chunk-wise processing of a Spectra. This will affect all functions working on peaks data (e.g. even lengths, mz, peaksData) and ensures that even large-scale data can be handled reducing out of memory errors.

What this PR adds:

Spectra gains a new slot @processingChunkSize.
Add functions processingChunkSize and processingChunkSize<- to get or set the size of chunks for chunk-wise processing. The default is Inf hence no chunk-wise processing is performed (important e.g. for small data sets of in-memory backends).
Add backendParallelFactor,MsBackend method: this allows backends to suggest a preferred splitting of the data into chunks. The default is to return factor() (i.e. no preferred splitting), MsBackendMzR on the other hand returns a factor depending on the "dataStorage" spectra variable (hence suggests splitting by original data file).
The internal peaksapply function uses either chunks defined through processingChunkSize for chunk-wise processing, or if not set, uses the suggested splitting from the backend (through backendParallelFactor).
The user can check if and how the Spectra will be split using the processingChunkFactor function that returns a factor representing the chunks (defined through processingChunkSize), or, if not set, the suggested splitting (through backendParallelFactor) or factor() in which case no chunk-wise processing is performed.

This processing is used for all Spectra methods accessing peaks data (or processing the peaks data). To avoid performance loss for small data sets or in-memory backends it is not performed by default. If enabled by the user, it allows to process also large data.

I think this is a very important improvement allowing the analysis of large (on-disk) data - for which we ran into unexpected issues (see #304).

Happy to discuss @sgibb @lgatto @philouail .

jorainer commented 7 months ago

@philouail , I've fixed some more things, can you please give again a careful look - any questions, concerns, comments or change requests highly welcome!

I've added now also a vignette describing the parallel processing settings. Please have a look at that too.

jorainer commented 7 months ago

Thanks for the reviews @andreavicini and @philouail ! I will merge now after having another look myself again.

sgibb commented 7 months ago

@jorainer sorry but I didn't review the code but a small suggestion anyway: If we add a new slot to the Spectra class it would break backward compatibility. So I would suggest to increment the "version" of the class and increment the minor number of the version field in the DESCRIPTION file.

jorainer commented 7 months ago

Hi Sebastian @sgibb , thanks for the suggestion. I'll increment the class version. I ensured backward compatibility through the accessor function that checks if the object has the slot and if not returns Inf (the default for the slot value). Also, Spectra methods will (automatically) call updateObject if required. So, backward compatibility should be guaranteed.

I would maybe not bump the minor version of the package to not interfere with the Bioconductor versioning?

rformassspectrometry / Spectra

Enable chunk-wise processing for all peaks data functions #306