Closed jorainer closed 7 months ago
@philouail , I've fixed some more things, can you please give again a careful look - any questions, concerns, comments or change requests highly welcome!
I've added now also a vignette describing the parallel processing settings. Please have a look at that too.
Thanks for the reviews @andreavicini and @philouail ! I will merge now after having another look myself again.
@jorainer sorry but I didn't review the code but a small suggestion anyway: If we add a new slot to the Spectra
class it would break backward compatibility. So I would suggest to increment the "version" of the class and increment the minor number of the version field in the DESCRIPTION file.
Hi Sebastian @sgibb , thanks for the suggestion. I'll increment the class version. I ensured backward compatibility through the accessor function that checks if the object has the slot and if not returns Inf
(the default for the slot value). Also, Spectra
methods will (automatically) call updateObject
if required. So, backward compatibility should be guaranteed.
I would maybe not bump the minor version of the package to not interfere with the Bioconductor versioning?
This PR fixes issue #304 . In brief: it adds the possibility for the user to set and define chunk-wise processing of a
Spectra
. This will affect all functions working on peaks data (e.g. evenlengths
,mz
,peaksData
) and ensures that even large-scale data can be handled reducing out of memory errors.What this PR adds:
Spectra
gains a new slot@processingChunkSize
.processingChunkSize
andprocessingChunkSize<-
to get or set the size of chunks for chunk-wise processing. The default isInf
hence no chunk-wise processing is performed (important e.g. for small data sets of in-memory backends).backendParallelFactor,MsBackend
method: this allows backends to suggest a preferred splitting of the data into chunks. The default is to returnfactor()
(i.e. no preferred splitting),MsBackendMzR
on the other hand returns afactor
depending on the"dataStorage"
spectra variable (hence suggests splitting by original data file).peaksapply
function uses either chunks defined throughprocessingChunkSize
for chunk-wise processing, or if not set, uses the suggested splitting from the backend (throughbackendParallelFactor
).Spectra
will be split using theprocessingChunkFactor
function that returns afactor
representing the chunks (defined throughprocessingChunkSize
), or, if not set, the suggested splitting (throughbackendParallelFactor
) orfactor()
in which case no chunk-wise processing is performed.This processing is used for all
Spectra
methods accessing peaks data (or processing the peaks data). To avoid performance loss for small data sets or in-memory backends it is not performed by default. If enabled by the user, it allows to process also large data.I think this is a very important improvement allowing the analysis of large (on-disk) data - for which we ran into unexpected issues (see #304).
Happy to discuss @sgibb @lgatto @philouail .