sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
40 stars 26 forks source link

header to report NA instead of 0? #205

Open jorainer opened 4 years ago

jorainer commented 4 years ago

I was wondering - as of now header is returning 0 for spectrum variables that are not defined (such as precursorMZ = 0 for MS1 spectra). Wouldn't it more intuitive to report NA in such cases?

Any comments @sneumann @lgatto ?

lgatto commented 4 years ago

Yes, I thing NA would be better in this case.

jorainer commented 4 years ago

Seems this is defined in RAMPAdapter.cpp within proteowizard - so changing this to NA might involve a little more coding that I expected. If we agree on its usability I could have a try...

lgatto commented 4 years ago

I would advise against modifying third-part/pwiz code, but would argue for a conversion on our side (either R or C level).

jorainer commented 4 years ago

Sure - I won't touch the pwiz code - but I would implement the respecive code to read out spectra header in our cpp code and avoid using the pwiz/RAMPadapter. In the meantime we are anyway reading more header parameters from the mzML files than the RAMPadapter provides.

lgatto commented 4 years ago

Sure. And it might not a bad thing considering deprecating RAMPadapter anyway on our end.

jorainer commented 4 years ago

I've implemented a header method that does not required the pwiz/RAMPAdapter.cpp. Below are the results from a comparison of the new header2 and old header methods (if successfull I will replace the original header with the new one.

library(mzR)
library(testthat)
library(microbenchmark)
## mzML file with only MS1 spectra
fl <- system.file("sciex/20171016_POOL_POS_1_105-134.mzML", package = "msdata")
msd <- openMSfile(fl)

hdr <- header(msd)
hdr2 <- mzR:::header2(msd)

## Identify header columns with 0 and don't check them - they will be NA in the new code
exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))

all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE

microbenchmark(header(msd), mzR:::header2(msd), times = 100)
Unit: milliseconds
               expr      min        lq      mean    median        uq      max
        header(msd) 115.9937 122.82290 126.78744 127.27785 129.81995 140.5113
 mzR:::header2(msd)  28.4591  29.53095  30.30068  29.99705  30.23355  37.8110
 neval cld
   100   b
   100  a 

## spectra subset
hdr <- header(msd, c(13, 17, 45))
hdr2 <- mzR:::header2(msd, c(13, 17, 45))
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE

microbenchmark(header(msd, c(13, 17, 45)), mzR:::header2(msd, c(13, 17, 45)))
Unit: milliseconds
                              expr     min       lq      mean   median       uq
        header(msd, c(13, 17, 45)) 16.9977 17.97255 18.782468 18.60515 19.25325
 mzR:::header2(msd, c(13, 17, 45))  1.7479  1.86010  1.941684  1.90020  1.96585
     max neval cld
 21.6552   100   b
  4.8813   100  a 

The new function returns the same result (except it reports NA instead of 0) and is considerably faster for the file above. I'm next going to compare that for some more mzML files (with MS2 spectra) and also for mzXML files.

jorainer commented 4 years ago

Next test: mzML file with MS2 spectra:

library(mzR)
library(testthat)
library(microbenchmark)
fl <- system.file("proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz", package = "msdata")
msd <- openMSfile(fl)

hdr <- header(msd)
hdr2 <- mzR:::header2(msd)

exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
[1] "ionisationEnergy"         "mergedScan"              
[3] "mergedResultScanNum"      "mergedResultStartScanNum"
[5] "mergedResultEndScanNum"  

all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “collisionEnergy”: 'is.NA' value mismatch: 58 in current 0 in target"   
[2] "Component “precursorScanNum”: 'is.NA' value mismatch: 58 in current 0 in target"  
[3] "Component “precursorMZ”: 'is.NA' value mismatch: 58 in current 0 in target"       
[4] "Component “precursorCharge”: 'is.NA' value mismatch: 58 in current 0 in target"   
[5] "Component “precursorIntensity”: 'is.NA' value mismatch: 58 in current 0 in target"

## There are differences in these columns: the new function returns NA for the precursor
## spectrum, the old function 0. So, after replacing 0s with NA
hdr[hdr == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “centroided”: 'is.NA' value mismatch: 0 in current 58 in target"
## We accept this - the hdr == 0 replaced also FALSE with NA, so that's the 
## difference for column centroided
close(msd)

## Test with a second file:
fl <- system.file("TripleTOF-SWATH/PestMix1_SWATH.mzML", package = "msdata")
msd <- openMSfile(fl)

hdr <- header(msd)
hdr2 <- mzR:::header2(msd)

exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
 [1] "collisionEnergy"          "ionisationEnergy"        
 [3] "lowMZ"                    "highMZ"                  
 [5] "precursorScanNum"         "precursorCharge"         
 [7] "precursorIntensity"       "mergedScan"              
 [9] "mergedResultScanNum"      "mergedResultStartScanNum"
[11] "mergedResultEndScanNum"   "injectionTime"           

all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “precursorMZ”: 'is.NA' value mismatch: 999 in current 0 in target"

hdr$precursorMZ[hdr$precursorMZ == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE

## spectra subset
hdr <- header(msd, c(13, 17, 45))
hdr2 <- mzR:::header2(msd, c(13, 17, 45))
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “precursorMZ”: 'is.NA' value mismatch: 1 in current 0 in target"

hdr$precursorMZ[hdr$precursorMZ == 0] <- NA
all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] TRUE

close(msd)

So, spectrum information returned by both functions are equivalent, with the only difference being that with the new code, missing information is encoded with NA and not longer with 0.

jorainer commented 4 years ago

At last: an mzXML file:

library(mzR)
library(testthat)
library(microbenchmark)

fl <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", package = "msdata")
msd <- openMSfile(fl)

hdr <- header(msd)
hdr2 <- mzR:::header2(msd)

exclude <- which(vapply(hdr, function(z) all(z == 0), logical(1)))
colnames(hdr)[exclude]
[1] "ionisationEnergy"         "mergedScan"              
[3] "mergedResultScanNum"      "mergedResultStartScanNum"
[5] "mergedResultEndScanNum"   "injectionTime"           

all.equal(hdr[, -exclude], hdr2[, -exclude])
[1] "Component “collisionEnergy”: 'is.NA' value mismatch: 5 in current 0 in target"   
[2] "Component “precursorScanNum”: 'is.NA' value mismatch: 5 in current 0 in target"  
[3] "Component “precursorMZ”: 'is.NA' value mismatch: 5 in current 0 in target"       
[4] "Component “precursorCharge”: 'is.NA' value mismatch: 5 in current 0 in target"   
[5] "Component “precursorIntensity”: 'is.NA' value mismatch: 5 in current 0 in target"

## Difference is again due to the different encoding of missing values:
hdr$precursorCharge
 [1] 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0
[39] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0

hdr2$precursorCharge
 [1] NA  1  1  0  0  0  0  0  0  0  0 NA  1  1  0  0  0  0  0  0  0  0 NA  1  1
[26]  0  0  0  0  0  0  0  0 NA  1  1  0  0  0  0  0  0  0  0 NA  1  1  0  0  0
[51]  0  0  0  0  0

The result returned by the two functions is equivalent, with the difference of the different encoding of missing values. Reporting 0 for missing information can even be problematic as shown on the last example above in which the result from header does not allow to distinguish between a charge of 0 and the missing precursor charge of a MS1 spectrum.

lgatto commented 4 years ago

This look good to me. Before pushing, it would be good to test/check all dependencies (in particular MSnbase and xcms) - I wouldn't be too surprised if some tests started to fail.

jorainer commented 4 years ago

Sure. I will build it locally on my computer and then check MSnbase and xcms.

jorainer commented 4 years ago

Packages to check:

jorainer commented 4 years ago

PR #207 would provide the new header function (and a new peaks implementation too).