PeptideEvidence in mzIdentML

GoogleCodeExporter commented 9 years ago

This issue focuses on the proposal to move the PeptideEvidence (PE)
object from being a child object of SpectrumIdentificationItem (SII) to
the same level as DBSequence and Peptide.

The current schema forces software and databases to create a vast amount
of PE objects which creates a significant computational overhead and
makes working with the file format extremely difficult.

The idea of this proposal is to use PE to represent the link between
Peptides and Protein Sequences. Basically, to represent the phrase "a
certain peptide in a certain protein at a certain position". Therefore,
a Peptide_ref was added to PE and PE was moved to be a child element of 
SequenceCollection.

As PE is no longer enzyme specific the enzyme specific attribute 
"missedCleavages" was removed. It has to be discussed if this information 
should be present in the schema and if so, where it should be put.

The SII was adapted to now hold a number of 0 .. n references to PE.
These optional references should represent all "peptides in proteins"
that could be inferred from the spectrum without any regard to protein
inference. It is also valid to just provide a Peptide_ref without any PE
refs in a SII. In such a case, the API would generate a list of all
possible PEs associated with this SII.

Protein inference should now be handled completely at the
ProteinDetectionHypothesis (PDH) level. Therefore, the
PeptideDetectionHypothesis was adapted to hold a number of 1..n SII_refs
(together with the PE ref as attribute). This would resemble the
statement that the PDH is backed up by this PE identified through the
following SIIs.

Original issue reported on code.google.com by johannes...@gmail.com on 11 Feb 2011 at 3:33

Attachments:

mzIdentML1.1.0.xsd

GoogleCodeExporter commented 9 years ago

I like the look of PE being in the sequence collection.

It appears to me that if we want to keep missedCleavages (and I would vote that 
we do), that this makes sense to be on SII either as an attribute, a new 
element or a cvParam.

For all simple cases, I think this model holds up, for complex cases with 
multiple enzymes, we may need to model which enzyme this refers to. Although 
this is probably equally not covered in the 1.0 schema.

"It is also valid to just provide a Peptide_ref without any PE
refs in a SII. In such a case, the API would generate a list of all
possible PEs associated with this SII."

Intuitively I don't like the sound of this, some file writers would produce 
PEs, others would not. If all this can be inferred by an API, the argument goes 
that PE is not needed at all. However, we do not enforce that the protein 
sequence be reported (since for some output formats this is not always possible 
without the searched database) so an API would not be able to infer pre, post 
or position. I would prefer that PE must be reported for all valid peptide to 
protein matches by the file writer

"Protein inference should now be handled completely at the
ProteinDetectionHypothesis (PDH) level. Therefore, the
PeptideDetectionHypothesis was adapted to hold a number of 1..n SII_refs
(together with the PE ref as attribute). This would resemble the
statement that the PDH is backed up by this PE identified through the
following SIIs."

Generally I agree with linking to SIIs from PDH. I'm coming round to the idea 
of also including the PE_ref, as a quick link to get to non redundant peptides 
identified without going via all SIIs. It makes a bit more work for writers but 
for some use cases, saves work for file readers. If we stick with this though, 
again I think PE cannot be optional.

Original comment by a...@cuckundoorecords.com on 14 Feb 2011 at 11:14

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

"Intuitively I don't like the sound of this, some file writers would produce 
PEs, others would not. If all this can be inferred by an API, the argument goes 
that PE is not needed at all. However, we do not enforce that the protein 
sequence be reported (since for some output formats this is not always possible 
without the searched database) so an API would not be able to infer pre, post 
or position. I would prefer that PE must be reported for all valid peptide to 
protein matches by the file writer"

This was meant differently. Not PEs are optional but the PE references from SII 
to PE thus PE_refs. As all PEs link to a Peptide the PEs that a SII refers to 
can be inferred from the Peptide(_ref). It is still mandatory to provide all 
possible PEs.

"Generally I agree with linking to SIIs from PDH. I'm coming round to the idea 
of also including the PE_ref, as a quick link to get to non redundant peptides 
identified without going via all SIIs. It makes a bit more work for writers but 
for some use cases, saves work for file readers. If we stick with this though, 
again I think PE cannot be optional."
This is exactly our current proposal. The PeptideHypothesis contains the PE_ref 
as attribute and a list of SII_refs as child elements (since several SIIs can 
link to the same PE). In this list, only the SIIs that were used for scoring 
should be included.

If we want to keep enzyme specific information I would not want to put them 
into SII. Even though this might be convenient at the moment it is not 
reflecting the nature of the information. Basically, it is part of a Peptide's 
properties in respect to a certain protein, thus would go at the PE level. In 
my opinion parameters or sub-elements with a reference to the respective 
enzymes seems more suited.

Original comment by johannes...@gmail.com on 14 Feb 2011 at 12:32

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

As discussed in the previous mzIdentML conference we updated the schema
proposal to solve the problem of enzyme specific information at the
PeptideEvidence (PE) level. A new element was created under 
"SequenceCollection" called "PeptideEvidenceList" (PEList). These 1:n PELists 
contains 1:n PEs plus 0:n enzyme references (and optional cv / user 
parameters). Furthermore, EnzymeType was changed to be an extension of 
"IdentifiableType".

If a protocol with two enzymes (A and B) is being used PEs can now be
grouped according to the enzyme(s) they come from. F.e. all PEs from
peptide A, all PEs from enzyme B and a third PEList for all PEs where
it's not sure if they come from A or B (this list then contains two
references).

As enzyme specific information should now no longer be a problem at the
PE level the previously removed attribute "missedCleavages" was added
again.

Additionally, we simplified the names of several elements removing the
f.e. "PSI....." part from the beginning of the name. At last, we changed
"SearchModificationType" as proposed in the last call. ModParam was
removed and all attributes as well as the cvParam were added to
"SearchModificationType". The multiplicity of cvParam was furthermore
changed to 1:n.

The proposed schema was added to the repository:
http://code.google.com/p/psi-pi/source/browse/trunk/schema/mzIdentML1.1.0.xsd

Original comment by johannes...@gmail.com on 28 Feb 2011 at 4:08

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Original comment by eisena...@googlemail.com on 3 Apr 2011 at 2:53

Added labels: Milestone-Release1.1
Removed labels: Version1.1

GoogleCodeExporter commented 9 years ago

agreed at Heidelberg

Original comment by eisena...@googlemail.com on 12 Apr 2011 at 9:14

Changed state: Fixed
Added labels: ****
Removed labels: ****

mwalzer / psi-pi

PeptideEvidence in mzIdentML #56