Open jorainer opened 4 years ago
Yes, I thing NA
would be better in this case.
Seems this is defined in RAMPAdapter.cpp within proteowizard - so changing this to NA
might involve a little more coding that I expected. If we agree on its usability I could have a try...
I would advise against modifying third-part/pwiz code, but would argue for a conversion on our side (either R or C level).
Sure - I won't touch the pwiz code - but I would implement the respecive code to read out spectra header in our cpp code and avoid using the pwiz/RAMPadapter. In the meantime we are anyway reading more header parameters from the mzML files than the RAMPadapter provides.
Sure. And it might not a bad thing considering deprecating RAMPadapter anyway on our end.
I've implemented a header
method that does not required the pwiz/RAMPAdapter.cpp. Below are the results from a comparison of the new header2
and old header
methods (if successfull I will replace the original header
with the new one.
library(mzR)
library(testthat)
library(microbenchmark)
## mzML file with only MS1 spectra
fl <- system.file("sciex/20171016_POOL_POS_1_105-134.mzML", package = "msdata")
msd <- openMSfile(fl)
hdr <- header(msd)
hdr2 <- mzR:::header2(msd)
## Identify header columns with 0 and don't check them - they will be NA in the new code
exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE
microbenchmark(header(msd), mzR:::header2(msd), times = 100)
Unit: milliseconds
expr min lq mean median uq max
header(msd) 115.9937 122.82290 126.78744 127.27785 129.81995 140.5113
mzR:::header2(msd) 28.4591 29.53095 30.30068 29.99705 30.23355 37.8110
neval cld
100 b
100 a
## spectra subset
hdr <- header(msd, c(13, 17, 45))
hdr2 <- mzR:::header2(msd, c(13, 17, 45))
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE
microbenchmark(header(msd, c(13, 17, 45)), mzR:::header2(msd, c(13, 17, 45)))
Unit: milliseconds
expr min lq mean median uq
header(msd, c(13, 17, 45)) 16.9977 17.97255 18.782468 18.60515 19.25325
mzR:::header2(msd, c(13, 17, 45)) 1.7479 1.86010 1.941684 1.90020 1.96585
max neval cld
21.6552 100 b
4.8813 100 a
The new function returns the same result (except it reports NA
instead of 0
) and is considerably faster for the file above. I'm next going to compare that for some more mzML files (with MS2 spectra) and also for mzXML files.
Next test: mzML file with MS2 spectra:
library(mzR)
library(testthat)
library(microbenchmark)
fl <- system.file("proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz", package = "msdata")
msd <- openMSfile(fl)
hdr <- header(msd)
hdr2 <- mzR:::header2(msd)
exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
[1] "ionisationEnergy" "mergedScan"
[3] "mergedResultScanNum" "mergedResultStartScanNum"
[5] "mergedResultEndScanNum"
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “collisionEnergy”: 'is.NA' value mismatch: 58 in current 0 in target"
[2] "Component “precursorScanNum”: 'is.NA' value mismatch: 58 in current 0 in target"
[3] "Component “precursorMZ”: 'is.NA' value mismatch: 58 in current 0 in target"
[4] "Component “precursorCharge”: 'is.NA' value mismatch: 58 in current 0 in target"
[5] "Component “precursorIntensity”: 'is.NA' value mismatch: 58 in current 0 in target"
## There are differences in these columns: the new function returns NA for the precursor
## spectrum, the old function 0. So, after replacing 0s with NA
hdr[hdr == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “centroided”: 'is.NA' value mismatch: 0 in current 58 in target"
## We accept this - the hdr == 0 replaced also FALSE with NA, so that's the
## difference for column centroided
close(msd)
## Test with a second file:
fl <- system.file("TripleTOF-SWATH/PestMix1_SWATH.mzML", package = "msdata")
msd <- openMSfile(fl)
hdr <- header(msd)
hdr2 <- mzR:::header2(msd)
exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
[1] "collisionEnergy" "ionisationEnergy"
[3] "lowMZ" "highMZ"
[5] "precursorScanNum" "precursorCharge"
[7] "precursorIntensity" "mergedScan"
[9] "mergedResultScanNum" "mergedResultStartScanNum"
[11] "mergedResultEndScanNum" "injectionTime"
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “precursorMZ”: 'is.NA' value mismatch: 999 in current 0 in target"
hdr$precursorMZ[hdr$precursorMZ == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE
## spectra subset
hdr <- header(msd, c(13, 17, 45))
hdr2 <- mzR:::header2(msd, c(13, 17, 45))
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “precursorMZ”: 'is.NA' value mismatch: 1 in current 0 in target"
hdr$precursorMZ[hdr$precursorMZ == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE
close(msd)
So, spectrum information returned by both functions are equivalent, with the only difference being that with the new code, missing information is encoded with NA
and not longer with 0
.
At last: an mzXML file:
library(mzR)
library(testthat)
library(microbenchmark)
fl <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", package = "msdata")
msd <- openMSfile(fl)
hdr <- header(msd)
hdr2 <- mzR:::header2(msd)
exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
[1] "ionisationEnergy" "mergedScan"
[3] "mergedResultScanNum" "mergedResultStartScanNum"
[5] "mergedResultEndScanNum" "injectionTime"
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “collisionEnergy”: 'is.NA' value mismatch: 5 in current 0 in target"
[2] "Component “precursorScanNum”: 'is.NA' value mismatch: 5 in current 0 in target"
[3] "Component “precursorMZ”: 'is.NA' value mismatch: 5 in current 0 in target"
[4] "Component “precursorCharge”: 'is.NA' value mismatch: 5 in current 0 in target"
[5] "Component “precursorIntensity”: 'is.NA' value mismatch: 5 in current 0 in target"
## Difference is again due to the different encoding of missing values:
hdr$precursorCharge
[1] 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0
[39] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
hdr2$precursorCharge
[1] NA 1 1 0 0 0 0 0 0 0 0 NA 1 1 0 0 0 0 0 0 0 0 NA 1 1
[26] 0 0 0 0 0 0 0 0 NA 1 1 0 0 0 0 0 0 0 0 NA 1 1 0 0 0
[51] 0 0 0 0 0
The result returned by the two functions is equivalent, with the difference of the different encoding of missing values. Reporting 0
for missing information can even be problematic as shown on the last example above in which the result from header
does not allow to distinguish between a charge of 0
and the missing precursor charge of a MS1 spectrum.
This look good to me. Before pushing, it would be good to test/check all dependencies (in particular MSnbase
and xcms
) - I wouldn't be too surprised if some tests started to fail.
Sure. I will build it locally on my computer and then check MSnbase
and xcms
.
Packages to check:
MSnbase
: https://github.com/lgatto/MSnbase/commit/b8551c97fe20115f3349b97b4276be9d951bcdb0xcms
: https://github.com/sneumann/xcms/commit/3063615b83022b52d9584abed89dcc7995c2c055msPurity
adductomicsR
CluMSID
MSnID
RMassBank
SIMAT
topdownr
yamss
PR #207 would provide the new header
function (and a new peaks
implementation too).
I was wondering - as of now
header
is returning0
for spectrum variables that are not defined (such asprecursorMZ = 0
for MS1 spectra). Wouldn't it more intuitive to reportNA
in such cases?Any comments @sneumann @lgatto ?