Closed GoogleCodeExporter closed 9 years ago
Sean will give both suggestions to a ABRF working group having deep experience
in
protein ambiguity / protein inference. I uploaded the files (without grouping
solution, alternative1, alternative2 and both alternatives in one XML) into the
repository to "/svn/trunk/examples/" (named Use_case_Toledo...).
CAUTION: In both alternatives the "molecule_ref" in the
"ProteinDetectionHypothesis"
seems to be redundant (is given later in "PeptideEvidence" element as "evidence"
attribute. I suggest to rename the "evidence" to "peptide_ref".
Original comment by eisena...@googlemail.com
on 29 Apr 2008 at 12:12
Maybe I misunderstood Andy's proposal at conference call yesterday. When I try
and
write some example XML, not sure that it looks better.
Andy proposed that we provide a separate 'lookup' for protein grouping rather
than
use 'nesting'. So, rather than have:
<ProteinDetectionResult identifier="group1">
<ProteinDetectionHypothesis identifier="Accession1" ref="HSP7D_MANSE">
<PeptideEvidence
. . .
<ProteinDetectionHypothesis identifier="Accession2" ref="HSP7D_FROG">
<PeptideEvidence
We would have something like:
<Protein identifier="Accession1" ref="HSP7D_MANSE">
<PeptideEvidence... />
</Protein>
<Protein identifier="Accession2" ref="HSP7D_FROG">
<PeptideEvidence... />
</Protein>
<ProteinGroupingSet>
<ProteinDetectionResult identifier="group1">
<DetectionHypothesis Protein_ref="Accession1">
<DetectionHypothesis Protein_ref="Accession2">
</ProteinDetectionResult>
<ProteinDetectionResult identifier="group2">
<DetectionHypothesis ref="Accession3">
...
</ProteinDetectionResult>
</ProteinGroupingSet>
I'm not so sure it gains us much because it means following more references
through
the file?
Original comment by dcre...@gmail.com
on 9 May 2008 at 3:21
Response from andrewrobertjonesv (by email)
Hi David,
The XML you've produced matches pretty well what I was proposing.
I agree that the original proposal makes for simpler XML but I was wondering
whether
it correctly captures the semantics of how protein identification is done in
different search engines.
To me it depends on whether there may be separate processes here:
1) Identification of peptides
2) Matching peptides to proteins i.e. simple protein identification
3) Deciding which proteins are most likely to be correct according to
statistical
thresholds / requirement for unique peptides etc.
If we keep the proposal as is, there is an implication from the software that
either
protein A or protein B is correct but not both, it may also rule out more
complicated
groupings without repeating the protein identifications.
Example
Prot A: Pep1, Pep11, Pep12
Prot B: Pep2, Pep11
Prot C: Pep2, Pep12
Prot D: Pep12,Pep16
In the example, A and B are conflicting, B and C are conflicting, A, B and D are
conflicting. Would these all be represented in the same group, even though
there is
no common peptide across all of them?
My proposal was to separate processes out a bit to allow for more flexibility:
1) All peptide to spectrum matches are represented once.
2) All proteins are listed once (containing references to the peptide evidence)
3) Output of grouping, protein statistical scoring, additional processing
represented
at least once.
Representing the data this way allows several different angles on which
proteins are
"correct" within a single file. I suppose this comes down to the scope of the
format,
are we intending that one file should only have one set of protein
identifications?
If so, the original proposal probably will work fine, with the caveat about it
preventing a protein being assigned to more than one group. If one file is
allowed to
contain several different processes to determine which proteins are correct, I
think
the grouping structure enforces that the same protein details would need to be
repeated - although perhaps this would not be a bad thing, since protein level
evidence could be used to determine which peptides are correct...?
Bit long winded response but perhaps I'm coming round to the original grouping
proposal. It would need to be documented very clearly what the groupings mean
since
the intended meaning is quite specific.
I think this is difficult to model because deciding whether a protein
identification
is correct has two orthogonal types of evidence:
1) Quality of peptide identifications
2) Conflicting assignments of peptides to proteins
The proposed grouping structure may capture this okay but I'd like to see a few
use
cases coded up first before I finally agree!
Original comment by dcre...@gmail.com
on 9 May 2008 at 3:28
> In the example, A and B are conflicting, B and C are conflicting,
> A, B and D are conflicting. Would these all be represented in the
> same group, even though there is no common peptide across all of them?
Yes, this could be the case
> are we intending that one file should only have one set of protein
identifications?
Not necessarily. E.g. could have Mascot and Protein Prophet ids
Original comment by dcre...@gmail.com
on 9 May 2008 at 3:34
I have to apologize that I can't really say whether or not I think the schema
works
until I see an instance document. However, I'm concerned that some of the
simpler
alternatives (if I understand them correctly) for how to capture protein
identifications are not adequate. I'm not even sure who wrote this originally
but
this concerns me:
> If we keep the proposal as is, there is an implication from the software that
either protein A or protein B is correct but not both, it may also rule out
more
complicated groupings without repeating the protein identifications.
>
> Example
>
> Prot A: Pep1, Pep11, Pep12
> Prot B: Pep2, Pep11
> Prot C: Pep2, Pep12
> Prot D: Pep12,Pep16
>
> In the example, A and B are conflicting, B and C are conflicting, A, B and D
are
conflicting. Would these all be represented in the same group, even though
there is
no common peptide across all of them?
The problem is that at the root, the spectra are really the evidence and these
can
also have multiple peptides that cannot necessarily be distinguished. While it
is
true that we establish the connection from a protein to the root spectral
evidence
via a link to specific peptide from that spectrum, competing/nearly
equivalent/undifferentiable (whatever you want to call it - proteins in a
'group')
protein hypotheses may try to explain the same spectrum with a different
peptide.
The most obvious is alternate peptide answers differing by only I/L. Thus,
there is
no problem with the lack of common peptides in the example above, so long as
there
are common spectra at the root. This is another reason why we cannot just list
all
peptides cited by proteins in a group - it must be possible for the peptide
links to
be specific to a protein hypothesis within a protein group. The perception of a
protein group corresponding to a detected analyte is more than just philosophy
here.
It directly dictates how we must handle a situation where we actually detect
multiple related isoforms: as separate protein groups that may have some
peptides in
common and will have some spectra in common. Easier to illustrate in an
example.
I'll do this if needed, but might be easier if someone made a proposed instance
doc
to start with. Note in the example above, whether there should be two protein
groups
or not cannot be determined from what is given. It depends on the spectral
intersection - ex. are pep1 and pep2 alternate answers to the same MSMS
spectrum?
Sean
Original comment by seanlsey...@gmail.com
on 9 May 2008 at 4:08
Sometimes we just go round in circles...
It turns out that Andy's proposal is exactly the same as Alex's proposal in
Toledo. I
obviously didn't explain it well enough for Sean to see the difference. Compare:
http://code.google.com/p/psi-pi/source/browse/trunk/examples/Use_case_Toledo_gro
uping_alt1_Sean.xml
http://code.google.com/p/psi-pi/source/browse/trunk/examples/Use_case_Toledo_gro
uping_alt2_Alexandre.xml
I think that we have agreed to go with Sean's proposal rather than Alex's
Original comment by dcre...@gmail.com
on 13 May 2008 at 10:51
We have a ProteinAmbiguityGroup element for that. Sean will revisit an example
Martin
assembled for ProGroup.
Original comment by eisena...@googlemail.com
on 17 Jun 2008 at 11:47
Original comment by eisena...@googlemail.com
on 17 Jun 2008 at 11:47
Original issue reported on code.google.com by
dcre...@gmail.com
on 28 Apr 2008 at 1:19