Closed shubham1637 closed 3 years ago
I am not sure where the errors comes from, and whether it is really an mzR
error. The code is too complex for me to dig in at the moment.
On the other hand, you might want to look into MSnbase
that efficiently handles many files at once and can extract chromatograms.
MSnbase creates chromatograms from a spectrum mzML file instead of reading them from a chromatogram mzML file.
mzR::chromatograms()
reads chromatogram based on their native ID
. I don't think MSnbase can be used to read the chromatogram mzML files.
I have simplified the code
filenames <- list.files()
mzPntrs <- lapply(filenames, mzR::openMSfile, backend = "pwiz")
runs <- c("run0", "run1")
names(mzPntrs) <- runs
# INDICES OF CHROMATOGRAMS TO BE READ
prec2Chrom <- list("run0" = list(c(138367, 139120, 139381, 139771, 139967, 140503), c(128977, 129271, 129484, 129919, 120448, 140926)),
"run1" = list(c(151348, 152101, 152362, 152752, 152948, 153484), c(71033, 71327, 71540, 71976, 72507, 72985)))
refs <- c("run1", "run0")
# worker = 1 : no error, worker > 1 : throws errors
BiocParallel::register(BiocParallel::MulticoreParam(workers = 6, log = TRUE, threshold = "TRACE", progressbar = T))
repeaT <- 10000
k <- BiocParallel::bplapply(1:repeaT, function(j) {
i <- (j %% 2) +1
ref <- refs[i]
chromIndices <- prec2Chrom[[ref]][[i]]
a <- mzR::chromatograms(mzPntrs[[ref]], chromIndices+j) # j is added for reading different chromatograms each time it runs
ep <- setdiff(runs, ref)
chromIndices <- prec2Chrom[[ep]][[i]]
b <- mzR::chromatograms(mzPntrs[[ep]], chromIndices+j) # j is added for reading different chromatograms each time it runs
invisible(NULL)
})
Note that the MSnbase::readSRMData
does in fact read only chromatogram data from mzML file(s).
In general, I would make sure that you are not reading in parallel from the same file. If you have to use parallel processing I would always split the workload by file. Multiple processes reading at the same time from the same file will be slow. Also, I am not sure how and if it works to use the same filehandle in different parallel processes. Maybe that could be a reason for these strange errors you see.
To tackle this problem I would 1) use lapply
instead of bplapply
, 2) write the code such that it reads chromatogram(s) from a single file at a time and 3), if all works, try the same with bplapply
.
I tried MSnbase::readSRMData
, it reads all the chromatograms instead of reading a specific one as done by mzR::chromatogram
.
Thanks for the suggestions.
The error seems to be coming from here
Which has the following responsibilities: https://github.com/sneumann/mzR/blob/a6e86e062b4d1923032e122e36e855591913ab5c/src/pwiz/utility/minimxml/SAXParser.cpp#L171-L174
Do you think, some of them could be violated by multiple threads reading the same file?
Do you think, some of them could be violated by multiple threads reading the same file?
That's what I think is the main problem. Multiple threads using the same file handle to read from the same file.
Hi, I am trying to fetch chromatograms in parallel with BiocParallel, however, I find an error most of the time. In the original analysis, I need to fetch almost 60k chromatograms from 16 runs. I am presenting a scaled-down version that only need two runs.
The chromatogram files are available at the Peptide Atlas.
If I use only one core, I see no error, but, when I use more than one core it throws an error that varies every time I run it. Also, if
repeaT = 100
sometimes it works or sometimes it fails with a similar error.Errors are like these:
R version info
I am not sure, why do I see these errors. Think these errors are ProteoWizard related, but not sure about the root cause of it.