Support protein grouping

GoogleCodeExporter commented 8 years ago

In current schema, DeterminationResultSet does not support protein
grouping. 2 suggestions (Sean, Alexandre), not TOO different but need to be
agreed upon.

Also, linked discussion on having same general type to replace
DeterminationResultSet and PolypeptideResultSet

Original issue reported on code.google.com by dcre...@gmail.com on 28 Apr 2008 at 1:19

GoogleCodeExporter commented 8 years ago

Sean will give both suggestions to a ABRF working group having deep experience 
in
protein ambiguity / protein inference. I uploaded the files (without grouping
solution, alternative1, alternative2 and both alternatives in one XML) into the
repository to "/svn/trunk/examples/" (named Use_case_Toledo...).

CAUTION: In both alternatives the "molecule_ref" in the 
"ProteinDetectionHypothesis"
seems to be redundant (is given later in "PeptideEvidence" element as "evidence"
attribute. I suggest to rename the "evidence" to "peptide_ref".

Original comment by eisena...@googlemail.com on 29 Apr 2008 at 12:12

GoogleCodeExporter commented 8 years ago

Maybe I misunderstood Andy's proposal at conference call yesterday. When I try 
and
write some example XML, not sure that it looks better.
Andy proposed that we provide a separate 'lookup' for protein grouping rather 
than
use 'nesting'. So, rather than have:

<ProteinDetectionResult  identifier="group1">
  <ProteinDetectionHypothesis identifier="Accession1" ref="HSP7D_MANSE">
    <PeptideEvidence
. . .
  <ProteinDetectionHypothesis identifier="Accession2" ref="HSP7D_FROG">
    <PeptideEvidence

We would have something like:
<Protein identifier="Accession1" ref="HSP7D_MANSE">
  <PeptideEvidence... />
</Protein>
<Protein identifier="Accession2" ref="HSP7D_FROG">
  <PeptideEvidence... />
</Protein>

<ProteinGroupingSet>
  <ProteinDetectionResult  identifier="group1">
     <DetectionHypothesis Protein_ref="Accession1">
     <DetectionHypothesis Protein_ref="Accession2">
  </ProteinDetectionResult>
  <ProteinDetectionResult  identifier="group2">
     <DetectionHypothesis ref="Accession3">
      ...
  </ProteinDetectionResult>
</ProteinGroupingSet>

I'm not so sure it gains us much because it means following more references 
through
the file?

Original comment by dcre...@gmail.com on 9 May 2008 at 3:21

GoogleCodeExporter commented 8 years ago

Response from andrewrobertjonesv (by email)
Hi David,

The XML you've produced matches pretty well what I was proposing.

I agree that the original proposal makes for simpler XML but I was wondering 
whether
it correctly captures the semantics of how protein identification is done in
different search engines.

To me it depends on whether there may be separate processes here:

1) Identification of peptides
2) Matching peptides to proteins i.e. simple protein identification
3) Deciding which proteins are most likely to be correct according to 
statistical
thresholds / requirement for unique peptides etc.

If we keep the proposal as is, there is an implication from the software that 
either
protein A or protein B is correct but not both, it may also rule out more 
complicated
groupings without repeating the protein identifications.

Example

Prot A: Pep1, Pep11, Pep12
Prot B: Pep2, Pep11
Prot C: Pep2, Pep12
Prot D: Pep12,Pep16

In the example, A and B are conflicting, B and C are conflicting, A, B and D are
conflicting. Would these all be represented in the same group, even though 
there is
no common peptide across all of them?

My proposal was to separate processes out a bit to allow for more flexibility:

1) All peptide to spectrum matches are represented once. 
2) All proteins are listed once (containing references to the peptide evidence)
3) Output of grouping, protein statistical scoring, additional processing 
represented
at least once.

Representing the data this way allows several different angles on which 
proteins are
"correct" within a single file. I suppose this comes down to the scope of the 
format,
are we intending that one file should only have one set of protein 
identifications?
If so, the original proposal probably will work fine, with the caveat about it
preventing a protein being assigned to more than one group. If one file is 
allowed to
contain several different processes to determine which proteins are correct, I 
think
the grouping structure enforces that the same protein details would need to be
repeated - although perhaps this would not be a bad thing, since protein level
evidence could be used to determine which peptides are correct...?

Bit long winded response but perhaps I'm coming round to the original grouping
proposal. It would need to be documented very clearly what the groupings mean 
since
the intended meaning is quite specific.

I think this is difficult to model because deciding whether a protein 
identification
is correct has two orthogonal types of evidence:
1) Quality of peptide identifications
2) Conflicting assignments of peptides to proteins

The proposed grouping structure may capture this okay but I'd like to see a few 
use
cases coded up first before I finally agree!

Original comment by dcre...@gmail.com on 9 May 2008 at 3:28

GoogleCodeExporter commented 8 years ago

> In the example, A and B are conflicting, B and C are conflicting, 
> A, B and D are conflicting. Would these all be represented in the 
> same group, even though there is no common peptide across all of them?
Yes, this could be the case

> are we intending that one file should only have one set of protein 
identifications?
Not necessarily. E.g. could have Mascot and Protein Prophet ids

Original comment by dcre...@gmail.com on 9 May 2008 at 3:34

GoogleCodeExporter commented 8 years ago

I have to apologize that I can't really say whether or not I think the schema 
works 
until I see an instance document. However, I'm concerned that some of the 
simpler 
alternatives (if I understand them correctly) for how to capture protein 
identifications are not adequate. I'm not even sure who wrote this originally 
but 
this concerns me:

> If we keep the proposal as is, there is an implication from the software that 
either protein A or protein B is correct but not both, it may also rule out 
more 
complicated groupings without repeating the protein identifications.
> 
> Example
> 
> Prot A: Pep1, Pep11, Pep12
> Prot B: Pep2, Pep11
> Prot C: Pep2, Pep12
> Prot D: Pep12,Pep16
> 
> In the example, A and B are conflicting, B and C are conflicting, A, B and D 
are 
conflicting. Would these all be represented in the same group, even though 
there is 
no common peptide across all of them?

The problem is that at the root, the spectra are really the evidence and these 
can 
also have multiple peptides that cannot necessarily be distinguished. While it 
is 
true that we establish the connection from a protein to the root spectral 
evidence 
via a link to specific peptide from that spectrum, competing/nearly 
equivalent/undifferentiable (whatever you want to call it - proteins in a 
'group')
protein hypotheses may try to explain the same spectrum with a different 
peptide. 
The most obvious is alternate peptide answers differing by only I/L. Thus, 
there is 
no problem with the lack of common peptides in the example above, so long as 
there 
are common spectra at the root. This is another reason why we cannot just list 
all 
peptides cited by proteins in a group - it must be possible for the peptide 
links to 
be specific to a protein hypothesis within a protein group. The perception of a 
protein group corresponding to a detected analyte is more than just philosophy 
here. 
It directly dictates how we must handle a situation where we actually detect 
multiple related isoforms: as separate protein groups that may have some 
peptides in 
common and will have some spectra in common. Easier to illustrate in an 
example. 
I'll do this if needed, but might be easier if someone made a proposed instance 
doc 
to start with. Note in the example above, whether there should be two protein 
groups 
or not cannot be determined from what is given. It depends on the spectral 
intersection - ex. are pep1 and pep2 alternate answers to the same MSMS 
spectrum?

Sean

Original comment by seanlsey...@gmail.com on 9 May 2008 at 4:08

GoogleCodeExporter commented 8 years ago

Sometimes we just go round in circles...
It turns out that Andy's proposal is exactly the same as Alex's proposal in 
Toledo. I
obviously didn't explain it well enough for Sean to see the difference. Compare:

http://code.google.com/p/psi-pi/source/browse/trunk/examples/Use_case_Toledo_gro
uping_alt1_Sean.xml
http://code.google.com/p/psi-pi/source/browse/trunk/examples/Use_case_Toledo_gro
uping_alt2_Alexandre.xml

I think that we have agreed to go with Sean's proposal rather than Alex's

Original comment by dcre...@gmail.com on 13 May 2008 at 10:51

GoogleCodeExporter commented 8 years ago

We have a ProteinAmbiguityGroup element for that. Sean will revisit an example 
Martin
assembled for ProGroup.

Original comment by eisena...@googlemail.com on 17 Jun 2008 at 11:47

GoogleCodeExporter commented 8 years ago

Original comment by eisena...@googlemail.com on 17 Jun 2008 at 11:47

Changed state: Fixed

vogelwk / psi-pi

Support protein grouping #6