Closed sgibb closed 7 years ago
Just wanted to add that I also stumbled upon this inconvenience.
I had to rely on the assumption that the spectrumID
reflects the order of the spectra in the mzml file. I'm not sure if this is always true.
I think it would be good to add an acquisition number to the psms
output; I have also been confused in the past. To avoid any ambiguity, retrieving the information at the C++
level and add a column to the data.frame
seems the best option. @thirdwing, could you have a look at some point?
There is no acquisition number in an mzIdentML file. Looking in the mzIdentML 1.1 specification, section 5.1.3 on page. The regular expression suggested will work in most cases: Thermo, Waters, Bruker BAF.
@thirdwing: looking at the getPsmInfo
method and the following mzid chunk
<SpectrumIdentificationResult spectraData_ref="SID_1" spectrumID="controllerType=0 controllerNumber=1 scan=11871" id="SIR_11871">
<SpectrumIdentificationItem passThreshold="true" rank="1" peptide_ref="Pep2" calculatedMassToCharge="876.8853759765625" experimentalMassToCharge="877.1377563476562" chargeState="4" id="SII_11871_1">
...
<cvParam accession="MS:1001115" cvRef="PSI-MS" value="11871" name="scan number(s)"/>
</SpectrumIdentificationResult>
would
spectrumIdResult[i]->cvParams[0].value
give me the value of the MS:1001115
cvParam, that I could then push to a new
std::vector<int> acquisitionNumber;
We had the same discussion when adding the column in mzID. Neither mzML nor mzIdentML has an aquisitionNum - it is entirely mzR based and derived from the spectrumID string. Usually it is scan=XXX, but not for all instruments - Agilent comes to mind.
See @sgibb PR 18 in mzID (https://github.com/thomasp85/mzID/pull/18), for his solution…
On 04 Feb 2015, at 23:03, Laurent Gatto notifications@github.com wrote:
There is no acquisition number in an mzIdentML file. Looking in the mzIdentML 1.1 specification http://www.psidev.info/sites/default/files/mzIdentML1.1.0.pdf, section 5.1.3 on page. The regular expression suggested will work in most cases: Thermo, Waters, Bruker BAF.
@thirdwing https://github.com/thirdwing: looking at the getPsmInfo https://github.com/sneumann/mzR/blob/master/src/RcppIdent.cpp#L116 method and the following mzid chunk
... would
spectrumIdResult[i]->cvParams[0].value give me the value of the MS:1001115 cvParam, that I could then push to a new
std::vector
acquisitionNumber; — Reply to this email directly or view it on GitHub https://github.com/sneumann/mzR/issues/17#issuecomment-72949497.
I have pushed @sgibb's suggestion above, as it is the same fix as in mzID
. There will be cases where this fails (consistently).
I still think extract the scan number(s)
would be a good thing. Will try to get a go at it.
As far as I recall @sgibb employed the same conversion as in done when reading mzML files. So yes, it can fail, but only in cases where the raw data fails too. What would be nice to define in mzR context is, for each subterm of MS:1000767 (native spectrum identifier), how should the acquisitionNum be generated. In theory this could be a list of term-id/regex that could be used by both mzR and mzID to ensure compatibility.
The current list of subterms is:
[Term] id: MS:1000768 name: Thermo nativeID format def: "Native format defined by controllerType=xsd:nonNegativeInteger controllerNumber=xsd:positiveInteger scan=xsd:positiveInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000769 name: Waters nativeID format def: "Native format defined by function=xsd:positiveInteger process=xsd:nonNegativeInteger scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000770 name: WIFF nativeID format def: "Native format defined by sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000771 name: Bruker/Agilent YEP nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000772 name: Bruker BAF nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000773 name: Bruker FID nativeID format def: "Native format defined by file=xsd:IDREF." [PSI:MS] comment: The nativeID must be the same as the source file ID. is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000774 name: multiple peak list nativeID format def: "Native format defined by index=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion of peak list files with multiple spectra, i.e. MGF, PKL, merged DTA files. Index is the spectrum number in the file, starting from 0. is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000775 name: single peak list nativeID format def: "Native format defined by file=xsd:IDREF." [PSI:MS] comment: The nativeID must be the same as the source file ID. Used for conversion of peak list files with one spectrum per file, typically folder of PKL or DTAs, each sourceFileRef is different. is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000776 name: scan number only nativeID format def: "Native format defined by scan=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion from mzXML, or DTA folder where native scan numbers can be derived. is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000777 name: spectrum identifier nativeID format def: "Native format defined by spectrum=xsd:nonNegativeInteger." [PSI:MS] comment: Used for conversion from mzData. The spectrum id attribute is referenced. is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000823 name: Bruker U2 nativeID format def: "Native format defined by declaration=xsd:nonNegativeInteger collection=xsd:nonNegativeInteger scan=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000824 name: no nativeID format def: "No nativeID format indicates that the file tagged with this term does not contain spectra that can have a nativeID format." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1000929 name: Shimadzu Biotech nativeID format def: "Native format defined by source=xsd:string start=xsd:nonNegativeInteger end=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1001480 name: AB SCIEX TOF/TOF nativeID format def: "Native format defined by jobRun=xsd:nonNegativeInteger spotLabel=xsd:string spectrum=xsd:nonNegativeInteger." [PSI:MS] is_a: MS:1000767 ! native spectrum identifier format
[Term] id: MS:1001508 name: Agilent MassHunter nativeID format def: "Native format defined by scanId=xsd:nonNegativeInteger." [PSI:PI] is_a: MS:1000767 ! native spectrum identifier format
In general it should be easy to have a function that takes a vector of id’s as well as a term id and returns a vector of acquisitionNum’s - If you agree this should be done I can put some time into it…
/Thomas
On 05 Feb 2015, at 10:35, Laurent Gatto notifications@github.com wrote:
I have pushed @sgibb's suggestion above, as it is the same fix as in mzID. There will be cases where this fails (consistently).
I still think extract the scan number(s) would be a good thing. Will try to get a go at it.
— Reply to this email directly or view it on GitHub.
In general it should be easy to have a function that takes a vector of id’s as well as a term id and returns a vector of acquisitionNum’s - If you agree this should be done I can put some time into it…
Sure, that would be great.
Hello,
as investigated in https://github.com/lgatto/msidmatching/issues/2#issuecomment-38047074 the
acquistionnum
/scan number
reported bymzR
for mzML/mzXML files contains the same numbers as thespectrumID
column in thepsms
output for mzIdentML files.Would it be possible to add a
acquisitionnum
column to thepsms
output to allow easier matching of identification and quantitation data similar to this approach inmzID
: https://github.com/thomasp85/mzID/pull/18/files? We use this column inMSnbase
for the matching.A possible solution (I am sure there is an Rcpp way as well):
Maybe it would be cleaner to look for these lines in the mzIdentML and parse them directly:
Best wishes,
Sebastian