rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

MsBackendHdf5Peaks cannot load large mzML files simultaneously #137

Open plantton opened 3 years ago

plantton commented 3 years ago

I'm currently using MsBackendHdf5Peaks backend to load several mzML files simultaneously:

fls <- "./*.mzML"  
sps <- Spectra(fls, source = MsBackendMzR(), backend = MsBackendHdf5Peaks(), hdf5path = getwd())

The average size of these mzML files is 600 to 700 MB. But the loading code above will simply freeze my desktop (16 GB RAM) after several minutes. One possible solution is to set BPPARAM = SerialParam() in the function, but then it's meaningless to set parallel processing parameter.

The maximum number of mzML can be read in parallel is 5 on my desktop.

jorainer commented 3 years ago

Thanks for reporting. Technically, the Spectra call above first creates a Spectra with an MsBackendMzR backend and then calls setBackend on that Spectra to change to an MsBackendHdf5Peaks backend. I would thus perform each of these steps separately to see where the error comes from.

Could you thus please simply call sps <- Spectra(fls, backend = MsBackendMzR()) to see if you get the error already there?

The BPPARAM allows to define the parallel processing for this call. Sometimes it makes more sense to disable parallel processing because parallel processing will always need more memory than serial param (each process will need memory that has to be merged into one final result objec). Note that I was able to read 6 files in parallel on my 8 core computer - and also had no problem creating a Spectra with MsBackendMzR backend containing 30,160,006 spectra from 17,562 mzML files (the size of the final object in memory was 6GB).

Could you then also please provide the output of your sessionInfo?

lgatto commented 3 years ago

@plantton - please follow up on this issue of close it.