lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

Error with parellelization on R 4.1 #564

Closed agnesblch closed 2 years ago

agnesblch commented 2 years ago

Last year I developped a R tool for in-silico prediction of metabolites and their detection in mzML files using some MSnbase functions. In this project I need to read a mzML file and search all putative predicted metabolites. To improve the calculation speed I use CPU parellelisation. However, since the release R 4.1, I get the error "cannot open the connection" when running my tool. It works fine if I set only 1 core to the cluster.

Here is an example using R 4.1.2 and MSnbase 2.15.7 : image

Using R 4.0.5 and MSnbase 2.15.7 works : image

Thanks in advance for your help

lgatto commented 2 years ago

I don't think this has anything to do with MSnbase per se - we haven't done any changes pertaining to the parallel processing, nor the chromatogram() function (@jorainer - can you confirm, in case I miss something). I suspect that the error stems from some changes somewhere else.

What happens when you type traceback() after the error?

jorainer commented 2 years ago

Are you running this on Windows? One possibility could be that Windows opens for each parallel task a new R instance (i.e. starts a new R process) that are independent from the "original" R process.

Also note that you might run into nested parallel processings here: one in foreach and then also chromatogram will use parallel processing on a per-file basis. That could turn out to be problematic - maybe not in this example because you're loading just one file. But you could specifically turn off parallel processing in chromatogram with chromatogram(..., BPPARAM = SerialParam()).

Also, I don't think your approach is not very efficient: you are parallelizing on m/z values and in each call using chromatogram to load the data. So, you will have parallel processes that will read the same file at the same time. Don't know how the operating system handles this, but I guess one process will have to wait until the other has finished reading...

As an alternative, you could define a matrix with columns "mzmin" and "mzmax". You can then pass this matrix with parameter mz to the chromatogram function that will then a) read the data only once b) extract the data for each row in mz and return that as a row in the returned result object. Parallelizing is not always the fastest option, since there will always be some additional overhead (splitting the data, sending the data to each process, collecting the results, merging the results) compare to serial processing. Note also that filtering in addition on retention time would speed up everything, as only the spectra within the retention time ranges would need to be read from the raw data files (data import is the slowest operation).

agnesblch commented 2 years ago

Thank you jorainer, I didn't notice that parallelisation is a default parameter in chromatogram. Setting BPParam = SerialParam() solve the error in R 4.1 on Windows and my tool works perfectly.

Regarding the retention time filtration, since the putative metabolites are predicted only based on parent drug molecular formula, I need the full chromatogram for the predicted m/z. Then I extract mass spectrum for each local maxima found in the chromatogram.

jorainer commented 2 years ago

Thanks for the update @agnesblch . I assume we can then close the issue? Feel free to re-open if needed.