Unknown start and end indices in PeptideEvidence

GoogleCodeExporter commented 8 years ago

Copied to the issues list, so we don't lose track - see detail below.

I agree this is not ideal, although I don't know if using a zero code is more 
useful than just leaving the attribute out. I guess this has the advantage that 
it is explicit that you are reporting an unknown position rather than just a 
bad exporter. It would be better if we had made the attribute mandatory and had 
a code for unknown. If a file reader is version 1.1.0 aware, it might cause a 
crash if it is expecting values >=1, but I suppose we are making other changes 
for version 1.1.1, so we could include this update there.

Andy

From: Seth Just [mailto:seth.just@proteomesoftware.com] 
Sent: 02 July 2013 18:51
To: psidev-pi-dev@lists.sourceforge.net
Subject: [Psidev-pi-dev] Unknown start and end indices in PeptideEvidence

Dear all,

We are running into a small issue with reporting peptide start and stop indices 
for Scaffold's mzIdentML export. The specification states that these attributes 
are optional, but "Must be provided unless this is a de novo search." However, 
in some situations (e.g. when Scaffold cannot align an identified peptide with 
a protein sequence because the sequence is absent or does not match what was 
searched) we do not have any meaningful indices to report.

At the moment we are reporting such peptide evidence elements with start and 
end set to zero:

<PeptideEvidence id="Pep_LZZC+57C+57BZPLLZZ_DBSeq_ALBU_CONTR" start="0" end="0" 
pre="?" post="?" isDecoy="false" dBSequence_ref="DBSeq_ALBU_CONTR" 
peptide_ref="LZZC+57C+57BZPLLZZ"/>
However, I feel that there is room for improvement here, as the current 
specification causes difficulty for both producers and consumers of the format. 
Particularly, there is an asymmetry in the way start/end and pre/post are 
reported: pre and post are also optional attributes, but the specification 
provides a value to indicate uncertainty ("If for any reason it is unknown 
(e.g.  denovo), pre="?" should be used.")

I feel that we should add a sentence to the specification document addressing 
how to report unknown start and end indices; something along the lines of "If 
for any reason it is unknown, start="0" should be used." This would not require 
schema changes, but would still allow consumers to clearly interpret files, 
even if start and end indices are not known.

Thoughts or concerns? If nobody objects, I feel that this should be put into 
the next revision of the specification so that we can unambiguously cover this 
use case.

Thanks!
-Seth

Original issue reported on code.google.com by andrewro...@googlemail.com on 4 Jul 2013 at 3:37

GoogleCodeExporter commented 8 years ago

I have made a proposal to the 1.2-candidate schema to add the following to the 
documentation for start and end:

"If the search method identifies a peptide the cannot be unambigously mapped to 
a start position in the referenced protein, a code of -1 MUST be used."

If we start accepting a few minor schema changes in mzid 1.2, I could also be 
persuaded to make these attributes mandatory - there is no good reason for them 
to be optional AND there be error codes for unknown values.

I have temporarily removed the documentation about de novo searches, since we 
will likely change this section anyway, following issue 82

Original comment by andrewro...@googlemail.com on 24 Jan 2014 at 9:36

GoogleCodeExporter commented 8 years ago

Discussed today - decision to allow error codes of -1 for start and stop. Keep 
them as optional for now (since we don't want to make any schema changes that 
could break existing examples), but strongly encourage their use

Original comment by andrewro...@googlemail.com on 31 Jan 2014 at 4:20

vogelwk / psi-pi

Unknown start and end indices in PeptideEvidence #79