sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
42 stars 27 forks source link

Add acquistionnum column to psms output #17

Closed sgibb closed 7 years ago

sgibb commented 9 years ago

Hello,

as investigated in https://github.com/lgatto/msidmatching/issues/2#issuecomment-38047074 the acquistionnum/scan number reported by mzR for mzML/mzXML files contains the same numbers as the spectrumID column in the psms output for mzIdentML files.

Would it be possible to add a acquisitionnum column to the psms output to allow easier matching of identification and quantitation data similar to this approach in mzID: https://github.com/thomasp85/mzID/pull/18/files? We use this column in MSnbase for the matching.

A possible solution (I am sure there is an Rcpp way as well):

## mzR/R/methods-mzRident.R
setMethod("psms",
signature=c("mzRident"),
function(object) { 
  psms <- object@backend$getPsmInfo()
  psms$acquisitionNum <- as.numeric(sub("^.*=([[:digit:]]+)$", "\\1", psms$spectrumID))
  return(psms)
})

Maybe it would be cleaner to look for these lines in the mzIdentML and parse them directly:

<cvParam accession="MS:1001115" cvRef="PSI-MS" value="9230" name="scan number(s)"/>

Best wishes,

Sebastian

adder commented 9 years ago

Just wanted to add that I also stumbled upon this inconvenience. I had to rely on the assumption that the spectrumID reflects the order of the spectra in the mzml file. I'm not sure if this is always true.

lgatto commented 9 years ago

I think it would be good to add an acquisition number to the psms output; I have also been confused in the past. To avoid any ambiguity, retrieving the information at the C++ level and add a column to the data.frame seems the best option. @thirdwing, could you have a look at some point?

lgatto commented 9 years ago

There is no acquisition number in an mzIdentML file. Looking in the mzIdentML 1.1 specification, section 5.1.3 on page. The regular expression suggested will work in most cases: Thermo, Waters, Bruker BAF.

@thirdwing: looking at the getPsmInfo method and the following mzid chunk

<SpectrumIdentificationResult spectraData_ref="SID_1" spectrumID="controllerType=0 controllerNumber=1 scan=11871" id="SIR_11871">
   <SpectrumIdentificationItem passThreshold="true" rank="1" peptide_ref="Pep2" calculatedMassToCharge="876.8853759765625" experimentalMassToCharge="877.1377563476562" chargeState="4" id="SII_11871_1">
   ...
   <cvParam accession="MS:1001115" cvRef="PSI-MS" value="11871" name="scan number(s)"/>
</SpectrumIdentificationResult>

would

spectrumIdResult[i]->cvParams[0].value

give me the value of the MS:1001115 cvParam, that I could then push to a new

std::vector<int> acquisitionNumber;
thomasp85 commented 9 years ago

We had the same discussion when adding the column in mzID. Neither mzML nor mzIdentML has an aquisitionNum - it is entirely mzR based and derived from the spectrumID string. Usually it is scan=XXX, but not for all instruments - Agilent comes to mind.

See @sgibb PR 18 in mzID (https://github.com/thomasp85/mzID/pull/18), for his solution…

On 04 Feb 2015, at 23:03, Laurent Gatto notifications@github.com wrote:

There is no acquisition number in an mzIdentML file. Looking in the mzIdentML 1.1 specification http://www.psidev.info/sites/default/files/mzIdentML1.1.0.pdf, section 5.1.3 on page. The regular expression suggested will work in most cases: Thermo, Waters, Bruker BAF.

@thirdwing https://github.com/thirdwing: looking at the getPsmInfo https://github.com/sneumann/mzR/blob/master/src/RcppIdent.cpp#L116 method and the following mzid chunk

...

would

spectrumIdResult[i]->cvParams[0].value give me the value of the MS:1001115 cvParam, that I could then push to a new

std::vector acquisitionNumber; — Reply to this email directly or view it on GitHub https://github.com/sneumann/mzR/issues/17#issuecomment-72949497.

lgatto commented 9 years ago

I have pushed @sgibb's suggestion above, as it is the same fix as in mzID. There will be cases where this fails (consistently).

I still think extract the scan number(s) would be a good thing. Will try to get a go at it.

thomasp85 commented 9 years ago

As far as I recall @sgibb employed the same conversion as in done when reading mzML files. So yes, it can fail, but only in cases where the raw data fails too. What would be nice to define in mzR context is, for each subterm of MS:1000767 (native spectrum identifier), how should the acquisitionNum be generated. In theory this could be a list of term-id/regex that could be used by both mzR and mzID to ensure compatibility.

The current list of subterms is:

[Term] id: MS:1000768 name: Thermo nativeID format def: "Native format defined by controllerType=xsd:nonNegativeInteger controllerNumber=xsd:positiveInteger scan=xsd:positiveInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000769 name: Waters nativeID format def: "Native format defined by function=xsd:positiveInteger process=xsd:nonNegativeInteger scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000770 name: WIFF nativeID format def: "Native format defined by sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000771 name: Bruker/Agilent YEP nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000772 name: Bruker BAF nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000773 name: Bruker FID nativeID format def: "Native format defined by file=xsd:IDREF." [PSI:MS] comment: The nativeID must be the same as the source file ID. is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000774 name: multiple peak list nativeID format def: "Native format defined by index=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion of peak list files with multiple spectra, i.e. MGF, PKL, merged DTA files. Index is the spectrum number in the file, starting from 0. is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000775 name: single peak list nativeID format def: "Native format defined by file=xsd:IDREF." [PSI:MS] comment: The nativeID must be the same as the source file ID. Used for conversion of peak list files with one spectrum per file, typically folder of PKL or DTAs, each sourceFileRef is different. is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000776 name: scan number only nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion from mzXML, or DTA folder where native scan numbers can be derived. is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000777 name: spectrum identifier nativeID format def: "Native format defined by spectrum=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion from mzData. The spectrum id attribute is referenced. is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000823 name: Bruker U2 nativeID format def: "Native format defined by declaration=xsd:nonNegativeInteger collection=xsd:nonNegativeInteger scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000824 name: no nativeID format def: "No nativeID format indicates that the file tagged with this term does not contain spectra that can have a nativeID format." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1000929 name: Shimadzu Biotech nativeID format def: "Native format defined by source=xsd:string start=xsd:nonNegativeInteger end=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1001480 name: AB SCIEX TOF/TOF nativeID format def: "Native format defined by jobRun=xsd:nonNegativeInteger spotLabel=xsd:string spectrum=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format

[Term] id: MS:1001508 name: Agilent MassHunter nativeID format def: "Native format defined by scanId=xsd:nonNegativeInteger." [PSI:PI] is_a: MS:1000767 ! native spectrum identifier format

In general it should be easy to have a function that takes a vector of id’s as well as a term id and returns a vector of acquisitionNum’s - If you agree this should be done I can put some time into it…

/Thomas

On 05 Feb 2015, at 10:35, Laurent Gatto notifications@github.com wrote:

I have pushed @sgibb's suggestion above, as it is the same fix as in mzID. There will be cases where this fails (consistently).

I still think extract the scan number(s) would be a good thing. Will try to get a go at it.

— Reply to this email directly or view it on GitHub.

lgatto commented 9 years ago

In general it should be easy to have a function that takes a vector of id’s as well as a term id and returns a vector of acquisitionNum’s - If you agree this should be done I can put some time into it…

Sure, that would be great.