mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Unusual characters in peptides, proteins or pre/post attributes #84

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Both the peptide sequence and protein sequence (DBSequence) have the following 
data type:

    <xsd:simpleType name="sequence">
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*"/>
        </xsd:restriction>
    </xsd:simpleType>

i.e. only upper case letters allowed.

pre/post attributes have following data type:

<xsd:restriction base="xsd:string">                    <xsd:pattern 
value="[ABCDEFGHIJKLMNOPQRSTUVWXYZ?\-]{1}"/>
</xsd:restriction>

Where ? is meant for de-novo and '-' is meant for terminus. We have spotted 
some problems in DBSequence and in pre/post where there are pseduo-genes in the 
search database containing stop codons ('*'). These can get passed through 
following conversion (e.g. from X!Tandem) causing file validation errors. 
Obviously real pseudeo-genes should not produce proteins, but one may well 
search for evidence of protein in a proteogenomics context.

Resolution:
- I don't think there is a case for Peptide to contain '*' - we should probably 
report this as an error and just reject the mzid at validation
- It seems reasonable for a DBSequence to contain '*' characters for stop 
codons, shall we make a special case exception for this in mzid 1.2?
- For pre/post should we also allow '*' with a defined meaning (stop codon), 
should it be converted to '-' meaning terminal, or some other special chars 
allowed?

We have observed that some search engines (X!Tandem in this case) allows '*' 
within the peptide or protein sequence, causing a validation error after 
conversion.

Original issue reported on code.google.com by andrewro...@googlemail.com on 4 Mar 2015 at 10:37