Closed GoogleCodeExporter closed 9 years ago
I had a chat with Simon H about this:
From Simon:
easy - don't explicitly model this at all - its far too controversial, will
depend on
search tool, and will make a right old mess of the xml. If you were going to do
it,
the lines below would constitute a decent start but would require a monumental
effort
from all concerned to consider all the extra wrinkles (charge state, mods,
losses,
refs back to the mzML, etc). We'd never get agreement !
Don't go there !
-Simon-
Jones, Andy wrote:
Hi all,
Last week on the analysisXML call we briefly discussed the requirement
for representing matches to fragment ions (y, b, neutral losses etc)
in the format but we haven’t made any progress yet. If you have any
thoughts on whether this is required, and how we might best represent
it, let me know.
I was thinking something along the lines of:
<Fragment expectedMass = “762.21” obsMass = “761.1” obsAbundance = “24”>
<Type cvParam name = “y-ion”>
<subsequence>APGC </subsequence>
</Fragment>
Original comment by andrewro...@googlemail.com
on 12 Jun 2008 at 2:46
TeleCon June 12th: For poor spectra with lots of noise, often get all of the
fragments identified, but these are misleading identifications.
Should search engines write out an mzML file with the processed spectra that
they
have analysed? Or engines report which peaks they have matched.
Redundant information - can be rebuilt. Size of file may be unmanageable.
Original comment by eisena...@googlemail.com
on 17 Jun 2008 at 12:17
[deleted comment]
TeleCon 10th July: Concerns were expressed that specifying fragment ions is
potentially very verbose and will bloat the XML files. Clearly this needs to
be an
optional element, if done at all. Users are currently divided over the need
for this
feature. Phil to post simple solution currently being implemented in PRIDE.
Original comment by eisena...@googlemail.com
on 15 Jul 2008 at 3:16
The latest development version of the PRIDE database includes a very simple
mechanism
for recording fragment ion information, illustrated below. (Please note - made
up data.)
In this example, CV terms are used to define the type of ion and related
information
/ annotation. Note that this is even more simple that the suggestion made by
Andy
above - no attempt is made here to indicate which residue has been called for
each
fragment ion - it is just listing the ions.
Also note that while the PeptideItem is referencing the mass spectrum (which is
reported in detail in the associated mzData file), the individual fragment ions
are
just reporting the m/z value and not attempting to make any kind of hard
reference to
the spectrum.
As you can see, this has been developed in collaboration with Waters, with
output
from the ProteinLynx Global Server. (Actual values / sequence have been
changed).
best regards,
Phil.
<PeptideItem>
<Sequence>LFQQSQWTREVFSNSCK</Sequence>
<Start>435</Start>
<End>460</End>
<SpectrumReference>123</SpectrumReference>
<FragmentIon>
<cvParam cvLabel="Waters" accession="PLGS:00032" name="b ion" value="3"/>
<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="379.2215"/>
<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1382.0"/>
<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-7.1543"/>
<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention
time
error" value="0.0207"/>
</FragmentIon>
<FragmentIon>
<cvParam cvLabel="Waters" accession="PLGS:00032" name="b ion" value="4"/>
<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="534.2811"/>
<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1242.0"/>
<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-8.2315"/>
<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention
time
error" value="0.0029"/>
</FragmentIon>
<FragmentIon>
<cvParam cvLabel="Waters" accession="PLGS:00031" name="y ion" value="3"/>
<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="394.1813"/>
<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="1917.0"/>
<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-14.7098"/>
<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention
time
error" value="-0.0013"/>
</FragmentIon>
<FragmentIon>
<cvParam cvLabel="Waters" accession="PLGS:00035" name="y ion -H2O" value="3"/>
<cvParam cvLabel="Waters" accession="PLGS:00024" name="product ion m/z"
value="367.1669"/>
<cvParam cvLabel="Waters" accession="PLGS:00025" name="product ion intensity"
value="345.0"/>
<cvParam cvLabel="Waters" accession="PLGS:00026" name="product ion m/z error"
value="-18.767"/>
<cvParam cvLabel="Waters" accession="PLGS:00027" name="product ion retention
time
error" value="0.0025"/>
</FragmentIon>
<additional>
<cvParam cvLabel="Waters" accession="PLGS:00014" name="precursor mass"
value="1971.9194"/>
<cvParam cvLabel="Waters" accession="PLGS:00015" name="precursor intensity"
value="181349.0"/>
<cvParam cvLabel="Waters" accession="PLGS:00016" name="precursor error in ppm"
value="0.8043"/>
<cvParam cvLabel="Waters" accession="PLGS:00017" name="precursor retention time
in
minutes" value="57.3537"/>
<cvParam cvLabel="Waters" accession="PLGS:00019" name="product ion mass RMS
error"
value="14.5969"/>
<cvParam cvLabel="Waters" accession="PLGS:00020" name="product ion retention
time RMS
error" value="0.0093"/>
<cvParam cvLabel="Waters" accession="PLGS:00021" name="weighted average charge
state"
value="2.2"/>
<cvParam cvLabel="Waters" accession="PLGS:00039" name="pass one match" value=""
/>
</additional>
</PeptideItem>
Original comment by philip.j...@gmail.com
on 17 Jul 2008 at 12:27
[deleted comment]
Try to summarize the discussion on the mailing list:
- clarify intention: only for annotation in viewers or for more
- controverse: info can be reconstructed or not (perhaps with a script of
vendor)
- different possibilities:
--with masses: verbose (PRIDE) => 10 GB files; arrays (phenyx);
--without masses: small syntax (Matt)
- separate file (later part of v2)
Original comment by eisena...@googlemail.com
on 30 Jul 2008 at 3:20
Original comment by eisena...@googlemail.com
on 31 Jul 2008 at 3:01
From sequest OUT file:
SrcDR_test_dave/SrcDR_test_dave.13537.13537.2.out
TurboSEQUEST - PVM Slave v.27 (rev. 12), (c) 1998-2005
Molecular Biotechnology, Univ. of Washington, J.Eng/S.Morgan/J.Yates
Licensed to Thermo Electron Corp.
07/15/2008, 07:12 PM, 4.8 sec. on jsblade4-g
(M+H)+ mass = 2190.11272 ~ 1.0000 (+2), fragment tol = 1.0000 , MONO/MONO
total inten = 9497.8, lowest Sp = 144.0, # matched peptides = 63253
# amino acids = 15106174, # proteins = 29282, /usr/database/refseq_human.200508
18.fasta
ion series nABY ABCDVWXYZ: 0 1 1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
display top 10/5, ion % = 0.0, CODE = 101020
(M* +15.9949) (C# +57.0520) (K@ +7.9322) (L^ +7.0276) Enzyme:Trypsin(KR) (2)
# Rank/Sp Id# (M+H)+ deltCn XCorr Sp Ions Reference
Peptide
--- -------- -------- -------- ------ ------ ----- ----- ---------
-------
1. 1 / 11 28550 2190.32661 0.0000 1.1511 207.6 13/54 gi|7706497|r
ef|NP_057392.1| R.CRSGL^L^HVLGLSFL^LQTRR.P
2. 2 /143 28550 2190.32661 0.0312 1.1152 145.9 11/54 gi|7706497|r
ef|NP_057392.1| R.CRSGLL^HVL^GLSFL^LQTRR.P
3. 3 / 2 28550 2190.32661 0.0341 1.1118 250.7 14/54 gi|7706497|r
ef|NP_057392.1| R.CRSGL^L^HVLGL^SFLLQTRR.P
4. 4 / 1 20341 2191.02776 0.0342 1.1118 293.6 14/51 gi|48717241|
ref|NP_001001661.1| K.LL^C#FDDEGTPRTKEEDCR.L
5. 4 / 1 20341 2191.02776 0.0342 1.1118 293.6 14/51 gi|48717241|
ref|NP_001001661.1| K.L^LC#FDDEGTPRTKEEDCR.L
6. 5 / 29 28550 2190.32661 0.0715 1.0688 180.9 12/54 gi|7706497|r
ef|NP_057392.1| R.CRSGLL^HVL^GL^SFLLQTRR.P
7. 6 / 40 28550 2190.32661 0.0715 1.0688 174.6 12/54 gi|7706497|r
ef|NP_057392.1| R.CRSGL^LHVL^GL^SFLLQTRR.P
8. 7 /105 11631 2190.10497 0.0963 1.0402 151.9 12/54 gi|71773329|
ref|NP_001146.2| +1 R.L^ILGLMMPPAHYDAK@QLK@K.A
11742 gi|71773415|ref|NP_004024.2| annex
9. 7 /105 11631 2190.10497 0.0963 1.0402 151.9 12/54 gi|71773329|
ref|NP_001146.2| +1 R.LIL^GLMMPPAHYDAK@QLK@K.A
11742 gi|71773415|ref|NP_004024.2| annex
10. 8 /105 11631 2190.10497 0.0963 1.0402 151.9 12/54 gi|71773329|
ref|NP_001146.2| +1 R.L^ILGLMMPPAHYDAK@QLKK@.A
11742 gi|71773415|ref|NP_004024.2| annex
1. 28550 gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
2. 28550 gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
3. 28550 gi|7706497|ref|NP_057392.1| cytidylate kinase [Homo sapiens
4. 20341 gi|48717241|ref|NP_001001661.1| zinc finger protein 425 [Homo s
apiens]
5. 20341 gi|48717241|ref|NP_001001661.1| zinc finger protein 425 [Homo s
apiens
Seq # b c y (+1)
--- -- --------- --------- --------- --
C 1 104.01646 121.04301 - 19
R 2 260.11757 277.14412 2087.31743 18
S 3 347.14960 364.17615 1931.21632 17
G 4 404.17106 421.19761 1844.18429 16
L 5 524.28276+ 541.30931 1787.16282+ 15
L 6 644.39445 661.42100 1667.05113 14
H 7 781.45336+ 798.47991 1546.93944 13
V 8 880.52178+ 897.54833 1409.88052 12
L 9 993.60584 1010.63239+ 1310.81211+ 11
G 10 1050.62731+ 1067.65385+ 1197.72805+ 10
L 11 1163.71137 1180.73792 1140.70658+ 9
S 12 1250.74340+ 1267.76995 1027.62252 8
F 13 1397.81181 1414.83836 940.59049+ 7
L 14 1517.92351+ 1534.95005 793.52208 6
L 15 1631.00757 1648.03412 673.41038 5
Q 16 1759.06615 1776.09270 560.32632 4
T 17 1860.11383 1877.14037 432.26774 3
R 18 2016.21494 2033.24149 331.22006 2
R 19 - - 175.11895 1
Original comment by delag...@gmail.com
on 31 Jul 2008 at 3:43
Just for reference, here's the proposal made by Matt Chambers:
For basic annotation, all I think is needed is the fragment type, series number,
charge state, and possibly any modification like a neutral loss or radical. The
array
can be an attribute or text node. We can use a grammar for each term, where
each term
represents an ion and terms are space delimited. The grammar might look like:
<a|b|c|x|y|z><# between 1 and peptide_length>[<+|-><formula>][,(<+|-><charge>]
We could make the charge part mandatory or if it was optional, assume a
+1 charge (or possibly allow the charge to be based on the polarity of
the source scan?). I assume there is a standard chemical formula format that
can be
represented compactly in ASCII text, but I don't know it.
An example to show how compact it could be:
fragmentIons="b3 y7,+2 b4 y5 y4 b7-H2O y3 y2 b7-H2O,+2 y3 y2"
Original comment by andrewro...@googlemail.com
on 1 Aug 2008 at 9:21
Here's the proposal discussed on the list by myself and Matt Chambers on 1st
August:
First up, setup a FragmentationTable for the entire list of the spectra, which
says
the kinds of measures you're going to report lower down:
<SpectrumIdentificationList id="MASCOT_results">
<FragmentationTable>
<Measures>
<Measure id = "mz">
<cvParam cvLabel="Waters" accession="PLGS:00024" name="product
ion m/z"/>
</Measure>
<Measure id = "intens">
<cvParam cvLabel="Waters" accession="PLGS:00025" name="product
ion intensity"/>
</Measure>
<Measure id = "mz_error">
<cvParam cvLabel="Waters" accession="PLGS:00026" name="product
ion m/z error"/>
</Measure>
<Measure id = "retent">
<cvParam cvLabel="Waters" accession="PLGS:00027" name="product
ion retention time error"/>
</Measure>
</Measures>
</FragmentationTable>
Then for each SpectrumIdentificationItem, you reference back to these Measures
<SpectrumIdentificationItem id="SEQ_spec1_pep1" Peptide_ref="prot1_pep1"
chargeState="1">
<PeptideEvidence id="PE1_SEQ_spec1_pep1" start="67" pre="-" end="79"
isDecoy="false" />
...
<Fragmentation>
<IonType cvLabel="Waters" accession="PLGS:00035" name="y ion -H2O" index="3 8 10"/>
<FragArray Measure_ref = "mz" values = "379.2215 457.1234 540.234"/>
<FragArray Measure_ref = "intens" values = "1382.0 2055.5 340.0"/>
<!-- and so on for other measures as defined in the
FragmentationTable -->
</IonType>
<IonType cvLabel="Waters" accession="PLGS:00032" name="b ion" index="2 12
14"/>
<FragArray Measure_ref = "mz" values = "560.153 859.111 945.653"/>
<FragArray Measure_ref = "intens" values = "502.0 330.5 559.5"/>
<!-- and so on for other measures as defined in the
FragmentationTable -->
</IonType>
</Fragmentation>
The IonType elements extends cvParam with an extra attribute for index of type
xsd:list. This could also be put instead make use of the value field of cvParam
(with
no XSD data type checking), I don't have much preference for doing it either
way.
Original comment by andrewro...@googlemail.com
on 6 Aug 2008 at 9:20
Looks 'good' to me. (Although I still claim it's all misleading, unnecessary
etc. etc.)
We should add "product ion m/z", "product ion intensity", "product ion m/z
error" and
the most common ions series to the PSI CV. The list at the bottom of:
http://www.matrixscience.com/help/fragmentation_help.html
may be helpful.
index is presumably 1 based, and for y type ions starts at the C terminus?
And I'd slightly prefer to use index as an xsd:list than the cv value.
Immonium ions would work OK with this format, although the m/z values won't be
an
ascending list, but the values listed in the page above. You'll be able to tell
which
immonium ion it is by using the index to look back into the peptide sequence
which is
probably OK even if not totally intuitive. The alternative is a different cv
value
for each immonium ion?
Presumably, we'll just say that internals aren't supported? Scroll part way
down in:
http://www.matrixscience.com/cgi/peptide_view.pl?file=../data/FoGArrS.dat&query=
1&hit=1&index=gi%7c229340&px=1§ion=5&ave_thresh=40
to see a table of ya and yb internals...
Original comment by dcre...@gmail.com
on 6 Aug 2008 at 10:21
Looks good to me, although no experience with it.
We think that the documentation of the (search engine's original) fragment ion
finding can be useful.
Is the saving of space by compressing the arrays significant (base64, ...)?
I added a "fragmentation information" term to the obo with the terms mentioned
above
and some ion types (just to have it in the obo; more documentation and
discussion
needed).
Original comment by eisena...@googlemail.com
on 7 Aug 2008 at 1:09
[deleted comment]
As martin describes above in comment 13, he has already added fragmentation ion
information to the OBO file. These existing additions are reproduced here
(just the
terms, showing the is-a hierarchy):
protein informatics cv
|_
search result details
|_
peptide result information
|_
fragmentation information
|_
frag: y ion
frag: b ion
frag: b ion - H2O
frag: y ion - H2O
product ion m/z ?? Presumably this is the observed,
rather than the calculated m/z ??
product ion intensity
product ion m/z error
frag: x ion
frag: a ion
frag: z ion
frag: c ion
Additional potential terms may include the following. Note that some of these
could be used in different contexts in the XML (i.e. annotations of the
entire peptide identification and not just a single fragment):
Please note that these are taken from the data published in PRIDE, originating
from Waters and are part of their controlled vocabulary for annotating
peptide identifications and fragment ions.
Some represent derived values that may not appear in analysisXML and so
can be disregarded.
[From Waters OBO file]:
b ion -NH3
y ion -NH3
number of product ions
average precursor RMS mass error
average product ion RMS retention time error
average product ions RMS mass error
precursor mass
precursor intensity
precursor error in ppm
precursor retention time in minutes
product ion mass RMS error
product ion retention time RMS error
weighted average charge state
product ion property
product ion retention time error
protein validation score
product ion type
in source ion
match with neutral loss
match with variable modification
match with missed cleavage
match with in source fragment
non-identified ion
co-eluting ion
[Other potential terms]:
Assuming "product ion m/z" is the observed m/z, a term for 'calculated m/z'?
ion charge
Original comment by philip.j...@gmail.com
on 18 Sep 2008 at 2:40
The other ion types worth adding are:
a ion -NH3
a ion -H20
d ion
v ion
w ion
immonium ion
Original comment by dcre...@gmail.com
on 25 Sep 2008 at 2:14
>Immonium ions would work OK with this format, although the m/z values won't be
an
>ascending list, but the values listed in the page above. You'll be able to
tell which
>immonium ion it is by using the index to look back into the peptide sequence
which is
>probably OK even if not totally intuitive. The alternative is a different cv
value
>for each immonium ion?
Looking back in the peptide sequence is somewhat counter-intuitive, unlike the
straightforward ion series. Perhaps the index list should not really be an
xsd:list,
but a context dependent string format. For ion series, the index list makes
perfect
sense. For immonium ions, isn't a format like "H F W" more intuitive?
>Presumably, we'll just say that internals aren't supported? Scroll part way
down in:
>http://www.matrixscience.com/cgi/peptide_view.pl?file=../data/FoGArrS.dat&query
=1&hit=1&index=gi%7c229340&px=1§ion=5&ave_thresh=40
>to see a table of ya and yb internals...
It scares me to settle on a format which definitively lacks the capability to
represent a generic concept like internal sequences. I think with the context
dependent index list we could represent these, perhaps as a list of pairs where
each
pair contains the N and C terminus offsets marking the begin and end of an
internal
subsequence, i.e. "3,5 3,6 4,9". The IonType could determine which series is
being
used (i.e. how to calculate the mass for a subsequence).
Original comment by matthew....@vanderbilt.edu
on 2 Oct 2008 at 2:28
The following terms have been added to the OBO file:
b ion -NH3
y ion -NH3
a ion -NH3
a ion -H20
d ion
v ion
w ion
immonium ion
non-identified ion
co-eluting ion
Any comments welcome.
Original comment by philip.j...@gmail.com
on 9 Oct 2008 at 2:13
Original comment by eisena...@googlemail.com
on 5 Nov 2008 at 4:18
Original issue reported on code.google.com by
eisena...@googlemail.com
on 12 Jun 2008 at 9:02