Both the peptide sequence and protein sequence (DBSequence) have the following
data type:
<xsd:simpleType name="sequence">
<xsd:restriction base="xsd:string">
<xsd:pattern value="[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*"/>
</xsd:restriction>
</xsd:simpleType>
i.e. only upper case letters allowed.
pre/post attributes have following data type:
<xsd:restriction base="xsd:string"> <xsd:pattern
value="[ABCDEFGHIJKLMNOPQRSTUVWXYZ?\-]{1}"/>
</xsd:restriction>
Where ? is meant for de-novo and '-' is meant for terminus. We have spotted
some problems in DBSequence and in pre/post where there are pseduo-genes in the
search database containing stop codons ('*'). These can get passed through
following conversion (e.g. from X!Tandem) causing file validation errors.
Obviously real pseudeo-genes should not produce proteins, but one may well
search for evidence of protein in a proteogenomics context.
Resolution:
- I don't think there is a case for Peptide to contain '*' - we should probably
report this as an error and just reject the mzid at validation
- It seems reasonable for a DBSequence to contain '*' characters for stop
codons, shall we make a special case exception for this in mzid 1.2?
- For pre/post should we also allow '*' with a defined meaning (stop codon),
should it be converted to '-' meaning terminal, or some other special chars
allowed?
We have observed that some search engines (X!Tandem in this case) allows '*'
within the peptide or protein sequence, causing a validation error after
conversion.
Original issue reported on code.google.com by andrewro...@googlemail.com on 4 Mar 2015 at 10:37
Original issue reported on code.google.com by
andrewro...@googlemail.com
on 4 Mar 2015 at 10:37