support reporting of fragment ions

GoogleCodeExporter commented 9 years ago

Currently don't have any support for reporting fragment ions. Phil to post
issue & possible solution.

Original issue reported on code.google.com by eisena...@googlemail.com on 12 Jun 2008 at 9:02

GoogleCodeExporter commented 9 years ago

I had a chat with Simon H about this:

From Simon:

easy - don't explicitly model this at all - its far too controversial, will 
depend on
search tool, and will make a right old mess of the xml. If you were going to do 
it,
the lines below would constitute a decent start but would require a monumental 
effort
from all concerned to consider all the extra wrinkles (charge state, mods, 
losses,
refs back to the mzML, etc). We'd never get agreement !

Don't go there !

-Simon-

Jones, Andy wrote:
Hi all,

Last week on the analysisXML call we briefly discussed the requirement 
for representing matches to fragment ions (y, b, neutral losses etc) 
in the format but we haven’t made any progress yet. If you have any 
thoughts on whether this is required, and how we might best represent 
it, let me know.

I was thinking something along the lines of:

 <Fragment expectedMass = “762.21” obsMass = “761.1” obsAbundance = “24”>
    <Type cvParam name = “y-ion”>
    <subsequence>APGC </subsequence>
 </Fragment>

Original comment by andrewro...@googlemail.com on 12 Jun 2008 at 2:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

TeleCon June 12th: For poor spectra with lots of noise, often get all of the
fragments identified, but these are misleading identifications.

Should search engines write out an mzML file with the processed spectra that 
they
have analysed? Or engines report which peaks they have matched.
Redundant information - can be rebuilt. Size of file may be unmanageable.

Original comment by eisena...@googlemail.com on 17 Jun 2008 at 12:17

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

TeleCon 10th July: Concerns were expressed that specifying fragment ions is
potentially very verbose and will bloat the XML files.  Clearly this needs to 
be an
optional element, if done at all.  Users are currently divided over the need 
for this
feature.  Phil to post simple solution currently being implemented in PRIDE.

Original comment by eisena...@googlemail.com on 15 Jul 2008 at 3:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The latest development version of the PRIDE database includes a very simple 
mechanism
for recording fragment ion information, illustrated below.  (Please note - made 
up data.)

In this example, CV terms are used to define the type of ion and related 
information
/ annotation.  Note that this is even more simple that the suggestion made by 
Andy
above - no attempt is made here to indicate which residue has been called for 
each
fragment ion - it is just listing the ions.

Also note that while the PeptideItem is referencing the mass spectrum (which is
reported in detail in the associated mzData file), the individual fragment ions 
are
just reporting the m/z value and not attempting to make any kind of hard 
reference to
the spectrum.

As you can see, this has been developed in collaboration with Waters, with 
output
from the ProteinLynx Global Server. (Actual values / sequence have been 
changed).

best regards,

Phil.

<PeptideItem>

<Sequence>LFQQSQWTREVFSNSCK</Sequence>

<Start>435</Start>

<End>460</End>

<SpectrumReference>123</SpectrumReference>
<FragmentIon>

<cvParam cvLabel="Waters" accession="PLGS:00032" name="b ion" value="3"/>

<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="379.2215"/>

<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1382.0"/>

<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-7.1543"/>

<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention 
time
error" value="0.0207"/>

</FragmentIon>

<FragmentIon>

<cvParam cvLabel="Waters" accession="PLGS:00032" name="b ion" value="4"/>

<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="534.2811"/>

<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1242.0"/>

<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-8.2315"/>

<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention 
time
error" value="0.0029"/>

</FragmentIon>
<FragmentIon>

<cvParam cvLabel="Waters" accession="PLGS:00031" name="y ion" value="3"/>

<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="394.1813"/>

<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1917.0"/>

<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-14.7098"/>

<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention 
time
error" value="-0.0013"/>

</FragmentIon>

<FragmentIon>

<cvParam cvLabel="Waters" accession="PLGS:00035" name="y ion -H2O" value="3"/>

<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="367.1669"/>

<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="345.0"/>

<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-18.767"/>

<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention 
time
error" value="0.0025"/>

</FragmentIon>
<additional>

<cvParam cvLabel="Waters" accession="PLGS:00014" name="precursor mass"
value="1971.9194"/>

<cvParam cvLabel="Waters" accession="PLGS:00015" name="precursor intensity"
value="181349.0"/>

<cvParam cvLabel="Waters" accession="PLGS:00016" name="precursor error in ppm"
value="0.8043"/>

<cvParam cvLabel="Waters" accession="PLGS:00017" name="precursor retention time 
in
minutes" value="57.3537"/>

<cvParam cvLabel="Waters" accession="PLGS:00019" name="product ion mass RMS 
error"
value="14.5969"/>

<cvParam cvLabel="Waters" accession="PLGS:00020" name="product ion retention 
time RMS
error" value="0.0093"/>

<cvParam cvLabel="Waters" accession="PLGS:00021" name="weighted average charge 
state"
value="2.2"/>

<cvParam cvLabel="Waters" accession="PLGS:00039" name="pass one match" value="" 
/>

</additional>

</PeptideItem>

Original comment by philip.j...@gmail.com on 17 Jul 2008 at 12:27

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Try to summarize the discussion on the mailing list:

- clarify intention: only for annotation in viewers or for more
- controverse: info can be reconstructed or not (perhaps with a script of 
vendor)
- different possibilities: 
  --with masses: verbose (PRIDE) => 10 GB files; arrays (phenyx); 
  --without masses: small syntax (Matt)
- separate file (later part of v2)

Original comment by eisena...@googlemail.com on 30 Jul 2008 at 3:20

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by eisena...@googlemail.com on 31 Jul 2008 at 3:01

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

From sequest OUT file: 
 SrcDR_test_dave/SrcDR_test_dave.13537.13537.2.out
 TurboSEQUEST - PVM Slave v.27 (rev. 12), (c) 1998-2005
 Molecular Biotechnology, Univ. of Washington, J.Eng/S.Morgan/J.Yates
 Licensed to Thermo Electron Corp.
 07/15/2008, 07:12 PM, 4.8 sec. on jsblade4-g
 (M+H)+ mass = 2190.11272 ~ 1.0000 (+2), fragment tol = 1.0000 , MONO/MONO
 total inten = 9497.8, lowest Sp = 144.0, # matched peptides = 63253
 # amino acids = 15106174, # proteins = 29282, /usr/database/refseq_human.200508
18.fasta
 ion series nABY ABCDVWXYZ: 0 1 1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
 display top 10/5, ion % = 0.0, CODE = 101020
 (M* +15.9949) (C# +57.0520) (K@ +7.9322) (L^ +7.0276)  Enzyme:Trypsin(KR) (2)

  #   Rank/Sp      Id#     (M+H)+    deltCn   XCorr    Sp    Ions   Reference   
                        Peptide
 ---  --------  --------  --------   ------  ------   -----  -----  ---------   
                        -------
  1.   1 / 11      28550 2190.32661  0.0000  1.1511   207.6  13/54  gi|7706497|r
ef|NP_057392.1|         R.CRSGL^L^HVLGLSFL^LQTRR.P
  2.   2 /143      28550 2190.32661  0.0312  1.1152   145.9  11/54  gi|7706497|r
ef|NP_057392.1|         R.CRSGLL^HVL^GLSFL^LQTRR.P
  3.   3 /  2      28550 2190.32661  0.0341  1.1118   250.7  14/54  gi|7706497|r
ef|NP_057392.1|         R.CRSGL^L^HVLGL^SFLLQTRR.P
  4.   4 /  1      20341 2191.02776  0.0342  1.1118   293.6  14/51  gi|48717241|
ref|NP_001001661.1|     K.LL^C#FDDEGTPRTKEEDCR.L
  5.   4 /  1      20341 2191.02776  0.0342  1.1118   293.6  14/51  gi|48717241|
ref|NP_001001661.1|     K.L^LC#FDDEGTPRTKEEDCR.L
  6.   5 / 29      28550 2190.32661  0.0715  1.0688   180.9  12/54  gi|7706497|r
ef|NP_057392.1|         R.CRSGLL^HVL^GL^SFLLQTRR.P
  7.   6 / 40      28550 2190.32661  0.0715  1.0688   174.6  12/54  gi|7706497|r
ef|NP_057392.1|         R.CRSGL^LHVL^GL^SFLLQTRR.P
  8.   7 /105      11631 2190.10497  0.0963  1.0402   151.9  12/54  gi|71773329|
ref|NP_001146.2|    +1  R.L^ILGLMMPPAHYDAK@QLK@K.A
                   11742  gi|71773415|ref|NP_004024.2| annex
  9.   7 /105      11631 2190.10497  0.0963  1.0402   151.9  12/54  gi|71773329|
ref|NP_001146.2|    +1  R.LIL^GLMMPPAHYDAK@QLK@K.A
                   11742  gi|71773415|ref|NP_004024.2| annex
 10.   8 /105      11631 2190.10497  0.0963  1.0402   151.9  12/54  gi|71773329|
ref|NP_001146.2|    +1  R.L^ILGLMMPPAHYDAK@QLKK@.A
                   11742  gi|71773415|ref|NP_004024.2| annex

  1.      28550  gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
  2.      28550  gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
  3.      28550  gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
  4.      20341  gi|48717241|ref|NP_001001661.1| zinc finger protein 425 [Homo s
apiens]
  5.      20341  gi|48717241|ref|NP_001001661.1| zinc finger protein 425 [Homo s
apiens

 Seq  #      b          c          y      (+1)
 --- --  ---------  ---------  ---------  --
  C   1   104.01646   121.04301      -      19
  R   2   260.11757   277.14412  2087.31743  18
  S   3   347.14960   364.17615  1931.21632  17
  G   4   404.17106   421.19761  1844.18429  16
  L   5   524.28276+  541.30931  1787.16282+ 15
  L   6   644.39445   661.42100  1667.05113  14
  H   7   781.45336+  798.47991  1546.93944  13
  V   8   880.52178+  897.54833  1409.88052  12
  L   9   993.60584  1010.63239+ 1310.81211+ 11
  G  10  1050.62731+ 1067.65385+ 1197.72805+ 10
  L  11  1163.71137  1180.73792  1140.70658+  9
  S  12  1250.74340+ 1267.76995  1027.62252   8
  F  13  1397.81181  1414.83836   940.59049+  7
  L  14  1517.92351+ 1534.95005   793.52208   6
  L  15  1631.00757  1648.03412   673.41038   5
  Q  16  1759.06615  1776.09270   560.32632   4
  T  17  1860.11383  1877.14037   432.26774   3
  R  18  2016.21494  2033.24149   331.22006   2
  R  19      -          -       175.11895   1

Original comment by delag...@gmail.com on 31 Jul 2008 at 3:43

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Just for reference, here's the proposal made by Matt Chambers:

For basic annotation, all I think is needed is the fragment type, series number,
charge state, and possibly any modification like a neutral loss or radical. The 
array
can be an attribute or text node. We can use a grammar for each term, where 
each term
represents an ion and terms are space delimited. The grammar might look like:
<a|b|c|x|y|z><# between 1 and peptide_length>[<+|-><formula>][,(<+|-><charge>]
We could make the charge part mandatory or if it was optional, assume a 
+1 charge (or possibly allow the charge to be based on the polarity of
the source scan?). I assume there is a standard chemical formula format that 
can be
represented compactly in ASCII text, but I don't know it.

An example to show how compact it could be:
fragmentIons="b3 y7,+2 b4 y5 y4 b7-H2O y3 y2 b7-H2O,+2 y3 y2"

Original comment by andrewro...@googlemail.com on 1 Aug 2008 at 9:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Here's the proposal discussed on the list by myself and Matt Chambers on 1st 
August:

First up, setup a FragmentationTable for the entire list of the spectra, which 
says
the kinds of measures you're going to report lower down:

<SpectrumIdentificationList id="MASCOT_results">
        <FragmentationTable>
            <Measures>
                <Measure id = "mz">
                    <cvParam cvLabel="Waters" accession="PLGS:00024" name="product
ion m/z"/>
                </Measure>
                <Measure id = "intens">
                    <cvParam cvLabel="Waters" accession="PLGS:00025" name="product
ion intensity"/>
                </Measure>    
                <Measure id = "mz_error">
                    <cvParam cvLabel="Waters" accession="PLGS:00026" name="product
ion m/z error"/>
                </Measure>       
                <Measure id = "retent">
                    <cvParam cvLabel="Waters" accession="PLGS:00027" name="product
ion retention time error"/>
                </Measure>
            </Measures>            
        </FragmentationTable>

Then for each SpectrumIdentificationItem, you reference back to these Measures 

<SpectrumIdentificationItem id="SEQ_spec1_pep1" Peptide_ref="prot1_pep1" 
chargeState="1">
        <PeptideEvidence id="PE1_SEQ_spec1_pep1" start="67" pre="-" end="79"
isDecoy="false"  />

    ...

        <Fragmentation>
         <IonType cvLabel="Waters" accession="PLGS:00035" name="y ion -H2O" index="3 8 10"/>
        <FragArray  Measure_ref = "mz" values = "379.2215 457.1234 540.234"/> 
                <FragArray  Measure_ref = "intens" values = "1382.0 2055.5 340.0"/>   
                <!-- and so on for other measures as defined in the
FragmentationTable -->  
            </IonType>
            <IonType cvLabel="Waters" accession="PLGS:00032" name="b ion" index="2 12
14"/>
                <FragArray  Measure_ref = "mz" values = "560.153 859.111 945.653"/> 
                <FragArray  Measure_ref = "intens" values = "502.0 330.5 559.5"/>  
                <!-- and so on for other measures as defined in the
FragmentationTable -->  
            </IonType>            
        </Fragmentation>

The IonType elements extends cvParam with an extra attribute for index of type
xsd:list. This could also be put instead make use of the value field of cvParam 
(with
no XSD data type checking), I don't have much preference for doing it either 
way.

Original comment by andrewro...@googlemail.com on 6 Aug 2008 at 9:20

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Looks 'good' to me. (Although I still claim it's all misleading, unnecessary 
etc. etc.)

We should add "product ion m/z", "product ion intensity", "product ion m/z 
error" and
the most common ions series to the PSI CV. The list at the bottom of:

http://www.matrixscience.com/help/fragmentation_help.html

may be helpful. 
index is presumably 1 based, and for y type ions starts at the C terminus?
And I'd slightly prefer to use index as an xsd:list than the cv value.

Immonium ions would work OK with this format, although the m/z values won't be 
an
ascending list, but the values listed in the page above. You'll be able to tell 
which
immonium ion it is by using the index to look back into the peptide sequence 
which is
probably OK even if not totally intuitive. The alternative is a different cv 
value
for each immonium ion?

Presumably, we'll just say that internals aren't supported? Scroll part way 
down in:
http://www.matrixscience.com/cgi/peptide_view.pl?file=../data/FoGArrS.dat&query=
1&hit=1&index=gi%7c229340&px=1§ion=5&ave_thresh=40
to see a table of ya and yb internals...

Original comment by dcre...@gmail.com on 6 Aug 2008 at 10:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Looks good to me, although no experience with it. 

We think that the documentation of the (search engine's original) fragment ion
finding can be useful.

Is the saving of space by compressing the arrays significant (base64, ...)?

I added a "fragmentation information" term to the obo with the terms mentioned 
above
and some ion types (just to have it in the obo; more documentation and 
discussion
needed).

Original comment by eisena...@googlemail.com on 7 Aug 2008 at 1:09

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

As martin describes above in comment 13, he has already added fragmentation ion
information to the OBO file.  These existing additions are reproduced here 
(just the
terms, showing the is-a hierarchy):

protein informatics cv
|_
    search result details
    |_
        peptide result information
        |_
            fragmentation information
            |_  
                frag: y ion
                frag: b ion
                frag: b ion - H2O
                frag: y ion - H2O
                product ion m/z             ?? Presumably this is the observed,
rather than the calculated m/z ??
                product ion intensity
                product ion m/z error
                frag: x ion
                frag: a ion
                frag: z ion
                frag: c ion

Additional potential terms may include the following.  Note that some of these
could be used in different contexts in the XML (i.e. annotations of the 
entire peptide identification and not just a single fragment):

Please note that these are taken from the data published in PRIDE, originating
from Waters and are part of their controlled vocabulary for annotating
peptide identifications and fragment ions.

Some represent derived values that may not appear in analysisXML and so
can be disregarded.

[From Waters OBO file]:
b ion -NH3

y ion -NH3

number of product ions

average precursor RMS mass error

average product ion RMS retention time error

average product ions RMS mass error

precursor mass

precursor intensity

precursor error in ppm

precursor retention time in minutes

product ion mass RMS error

product ion retention time RMS error

weighted average charge state

product ion property

product ion retention time error

protein validation score

product ion type

in source ion

match with neutral loss

match with variable modification

match with missed cleavage

match with in source fragment

non-identified ion

co-eluting ion

[Other potential terms]:
Assuming "product ion m/z" is the observed m/z, a term for 'calculated m/z'?
ion charge

Original comment by philip.j...@gmail.com on 18 Sep 2008 at 2:40

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The other ion types worth adding are:
a ion -NH3
a ion -H20
d ion
v ion
w ion
immonium ion

Original comment by dcre...@gmail.com on 25 Sep 2008 at 2:14

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

>Immonium ions would work OK with this format, although the m/z values won't be 
an
>ascending list, but the values listed in the page above. You'll be able to 
tell which
>immonium ion it is by using the index to look back into the peptide sequence 
which is
>probably OK even if not totally intuitive. The alternative is a different cv 
value
>for each immonium ion?

Looking back in the peptide sequence is somewhat counter-intuitive, unlike the
straightforward ion series. Perhaps the index list should not really be an 
xsd:list,
but a context dependent string format. For ion series, the index list makes 
perfect
sense. For immonium ions, isn't a format like "H F W" more intuitive?

>Presumably, we'll just say that internals aren't supported? Scroll part way 
down in:
>http://www.matrixscience.com/cgi/peptide_view.pl?file=../data/FoGArrS.dat&query
=1&hit=1&index=gi%7c229340&px=1§ion=5&ave_thresh=40
>to see a table of ya and yb internals...
It scares me to settle on a format which definitively lacks the capability to
represent a generic concept like internal sequences. I think with the context
dependent index list we could represent these, perhaps as a list of pairs where 
each
pair contains the N and C terminus offsets marking the begin and end of an 
internal
subsequence, i.e. "3,5 3,6 4,9". The IonType could determine which series is 
being
used (i.e. how to calculate the mass for a subsequence).

Original comment by matthew....@vanderbilt.edu on 2 Oct 2008 at 2:28

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The following terms have been added to the OBO file:

b ion -NH3
y ion -NH3
a ion -NH3
a ion -H20
d ion
v ion
w ion
immonium ion
non-identified ion
co-eluting ion

Any comments welcome.

Original comment by philip.j...@gmail.com on 9 Oct 2008 at 2:13

Changed state: Started
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by eisena...@googlemail.com on 5 Nov 2008 at 4:18

Changed state: Fixed
Added labels: ****
Removed labels: ****

mwalzer / psi-pi

support reporting of fragment ions #28