sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
40 stars 26 forks source link

Unable to read chromatograms in parallel #228

Closed shubham1637 closed 3 years ago

shubham1637 commented 3 years ago

Hi, I am trying to fetch chromatograms in parallel with BiocParallel, however, I find an error most of the time. In the original analysis, I need to fetch almost 60k chromatograms from 16 runs. I am presenting a scaled-down version that only need two runs.

The chromatogram files are available at the Peptide Atlas.

mkdir Temp && cd Temp
wget ftp://PASS01508:KN6663gq@ftp.peptideatlas.org/results/mzml/hroest_K120809_Strep0%PlasmaBiolRepl1_R04_SW.chrom.mzML
wget ftp://PASS01508:KN6663gq@ftp.peptideatlas.org/results/mzml/hroest_K120809_Strep10%PlasmaBiolRepl2_R04_SW.chrom.mzML

If I use only one core, I see no error, but, when I use more than one core it throws an error that varies every time I run it. Also, if repeaT = 100 sometimes it works or sometimes it fails with a similar error.

setwd("Temp")
filenames <- list.files()
mzPntrs <- lapply(filenames, mzR::openMSfile, backend = "pwiz")
runs <- c("run0", "run1")
names(mzPntrs) <- runs
prec2Chrom <- list("run0" = list(c(138367, 139120, 139381, 139771, 139967, 140503), c(128977, 129271, 129484, 129919, 120448, 140926)),
"run1" = list(c(151348, 152101, 152362, 152752, 152948, 153484), c(71033, 71327, 71540, 71976, 72507, 72985)))
refs <- c("run1", "run0")

BiocParallel::register(BiocParallel::MulticoreParam(workers = 1, log = TRUE, threshold = "TRACE"))
repeaT <- 10000
k <- BiocParallel::bplapply(1:repeaT, function(j) {
    i <- (j %% 2) +1
    ref <- refs[i]
    chromIndices <- prec2Chrom[[ref]][i]
    a <- lapply(chromIndices, function(i1) mzR::chromatograms(mzPntrs[[ref]], i1+j))
    exps <- setdiff(runs, ref)
    for(ep in exps) {
        chromIndices <- prec2Chrom[[ep]][i]
        b <- lapply(chromIndices, function(i1) mzR::chromatograms(mzPntrs[[ep]], i1+j))
    }
    2*i
})

BiocParallel::register(BiocParallel::MulticoreParam(workers = 6, log = TRUE, threshold = "TRACE"))
repeaT <- 10000
k <- BiocParallel::bplapply(1:repeaT, function(j) {
    i <- (j %% 2) +1
    ref <- refs[i]
    chromIndices <- prec2Chrom[[ref]][i]
    a <- lapply(chromIndices, function(i1) mzR::chromatograms(mzPntrs[[ref]], i1+j))
    exps <- setdiff(runs, ref)
    for(ep in exps) {
        chromIndices <- prec2Chrom[[ep]][i]
        b <- lapply(chromIndices, function(i1) mzR::chromatograms(mzPntrs[[ep]], i1+j))
    }
    2*i
})

Errors are like these:

stderr and stdout:
ERROR [2020-08-17 20:25:20] [SAXParser::ParserWrangler::elementEnd()] Illegal end tag "binaryDataArray" at offset 133057.
stderr and stdout:
ERROR [2020-08-17 20:24:18] [SAXParser::ParserWrangler::elementEnd()] Illegal end tag "offset" at offset 11141133.
stderr and stdout:
ERROR [2020-08-17 20:17:51] [SAXParser::ParserWrangler::elementEnd()] Illegal end tag "index" at offset 9359909.
stop worker failed:
  attempt to select less than one element in OneIndex 

R version info

version  R version 4.0.2 (2020-06-22)
mzR            2.22.0  2020-04-27 [1] Bioconductor  
BiocParallel   1.22.0  2020-04-27 [1] Bioconductor

I am not sure, why do I see these errors. Think these errors are ProteoWizard related, but not sure about the root cause of it.

lgatto commented 3 years ago

I am not sure where the errors comes from, and whether it is really an mzR error. The code is too complex for me to dig in at the moment.

On the other hand, you might want to look into MSnbase that efficiently handles many files at once and can extract chromatograms.

shubham1637 commented 3 years ago

MSnbase creates chromatograms from a spectrum mzML file instead of reading them from a chromatogram mzML file. mzR::chromatograms() reads chromatogram based on their native ID. I don't think MSnbase can be used to read the chromatogram mzML files.

I have simplified the code

filenames <- list.files()
mzPntrs <- lapply(filenames, mzR::openMSfile, backend = "pwiz")
runs <- c("run0", "run1")
names(mzPntrs) <- runs
# INDICES OF CHROMATOGRAMS TO BE READ
prec2Chrom <- list("run0" = list(c(138367, 139120, 139381, 139771, 139967, 140503), c(128977, 129271, 129484, 129919, 120448, 140926)),
"run1" = list(c(151348, 152101, 152362, 152752, 152948, 153484), c(71033, 71327, 71540, 71976, 72507, 72985)))
refs <- c("run1", "run0")

# worker = 1 : no error, worker > 1 : throws errors 
BiocParallel::register(BiocParallel::MulticoreParam(workers = 6, log = TRUE, threshold = "TRACE", progressbar = T))
repeaT <- 10000
k <- BiocParallel::bplapply(1:repeaT, function(j) {
    i <- (j %% 2) +1
    ref <- refs[i]
    chromIndices <- prec2Chrom[[ref]][[i]]
    a <- mzR::chromatograms(mzPntrs[[ref]], chromIndices+j) # j is added for reading different chromatograms each time it runs
    ep <- setdiff(runs, ref)
    chromIndices <- prec2Chrom[[ep]][[i]]
    b <- mzR::chromatograms(mzPntrs[[ep]], chromIndices+j) # j is added for reading different chromatograms each time it runs
    invisible(NULL)
})
jorainer commented 3 years ago

Note that the MSnbase::readSRMData does in fact read only chromatogram data from mzML file(s).

In general, I would make sure that you are not reading in parallel from the same file. If you have to use parallel processing I would always split the workload by file. Multiple processes reading at the same time from the same file will be slow. Also, I am not sure how and if it works to use the same filehandle in different parallel processes. Maybe that could be a reason for these strange errors you see.

To tackle this problem I would 1) use lapply instead of bplapply, 2) write the code such that it reads chromatogram(s) from a single file at a time and 3), if all works, try the same with bplapply.

shubham1637 commented 3 years ago

I tried MSnbase::readSRMData, it reads all the chromatograms instead of reading a specific one as done by mzR::chromatogram. Thanks for the suggestions.

shubham1637 commented 3 years ago

The error seems to be coming from here

https://github.com/sneumann/mzR/blob/a6e86e062b4d1923032e122e36e855591913ab5c/src/pwiz/utility/minimxml/SAXParser.cpp#L229

Which has the following responsibilities: https://github.com/sneumann/mzR/blob/a6e86e062b4d1923032e122e36e855591913ab5c/src/pwiz/utility/minimxml/SAXParser.cpp#L171-L174

Do you think, some of them could be violated by multiple threads reading the same file?

jorainer commented 3 years ago

Do you think, some of them could be violated by multiple threads reading the same file?

That's what I think is the main problem. Multiple threads using the same file handle to read from the same file.