topdownproteomics / ProteoformNomenclatureStandard

ProForma, a Proteoform Notation Standard
https://topdownproteomics.github.io/ProteoformNomenclatureStandard/
4 stars 5 forks source link

Minor comments - AA ambiguity #13

Closed jscottrell closed 6 years ago

jscottrell commented 7 years ago

Section 3 Rule 1: Maybe J should be allowed since I/L ambiguity is expected in any sequence characterised solely by MS.

Section 3 Rule 5: (Also Section 5) The Unimod 'PSI-MS Name' is the preferred name. The 'Interim name' should only be used when the PSI-MS Name is empty.

stefanks commented 7 years ago

jscottrell, we added these two items to the agenda for our next meeting, and will make the appropriate changes once we discuss them. Thank you!

trishorts commented 7 years ago

We haven't really touched ambiguity in version 1.0. We discussed requirement for round one that when we call a proteoform, we know the amino acid sequence. We definitely got to tackle this at some point b/c almost noone ever knows the end to end sequence unequivocally. The amino acid encoding B and Z within uniprot have caused us numerous headaches in bottom up. I guess 'X' is also there. Here is uniprot table. Don't see J but I understand it.

                6.1  Composition in percent for the complete database

                Ala (A) 8.17   Gln (Q) 3.95   Leu (L) 9.67   Ser (S) 6.62
                Arg (R) 5.50   Glu (E) 6.74   Lys (K) 5.87   Thr (T) 5.34
                Asn (N) 4.07   Gly (G) 7.04   Met (M) 2.41   Trp (W) 1.09
                Asp (D) 5.42   His (H) 2.28   Phe (F) 3.88   Tyr (Y) 2.93
                Cys (C) 1.40   Ile (I) 5.94   Pro (P) 4.74   Val (V) 6.82

                Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.00
jscottrell commented 7 years ago

Although J isn't in the IUPAC standard, NCBI nr currently contains 35,630 J

Residue Frequency A 4192301557 B 51330 C 545892333 D 2478417080 E 2788235552 F 1731414549 G 3313686996 H 1003343003 I 2485386623 J 35630 K 2209498580 L 4482982267 M 1043588980 N 1732651536 O 261 P 2236526114 Q 1762729851 R 2608941046 S 3045137327 T 2527536348 U 14385 V 3108487093 W 586662197 X 10248364 Y 1305860474 Z 18233

John Cottrell Matrix Science Ltd. 64 Baker Street London W1U 7GB, UK Tel: +44 20 7486 1050 Fax: +44 20 7224 1344 jcottrell@matrixscience.com http://www.matrixscience.com

Matrix Science Ltd. is registered in England and Wales Company number 3533898

On 12/06/2017 21:36, trishorts wrote:

We haven't really touched ambiguity in version 1.0. We discussed requirement for round one that when we call a proteoform, we know the amino acid sequence. We definitely got to tackle this at some point b/c almost noone ever knows the end to end sequence unequivocally. The amino acid encoding B and Z within uniprot have caused us numerous headaches in bottom up. I guess 'X' is also there. Here is uniprot table. Don't see J but I understand it.

|6.1 Composition in percent for the complete database Ala (A) 8.17 Gln (Q) 3.95 Leu (L) 9.67 Ser (S) 6.62 Arg (R) 5.50 Glu (E) 6.74 Lys (K) 5.87 Thr (T) 5.34 Asn (N) 4.07 Gly (G) 7.04 Met (M) 2.41 Trp (W) 1.09 Asp (D) 5.42 His (H) 2.28 Phe (F) 3.88 Tyr (Y) 2.93 Cys (C) 1.40 Ile (I) 5.94 Pro (P) 4.74 Val (V) 6.82 Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.00 |

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/13#issuecomment-307916480, or mute the thread https://github.com/notifications/unsubscribe-auth/AcBCfugsDhQEPqj2hap48kAtSbBmGbYFks5sDaFSgaJpZM4N3NmD.

hollenstein commented 7 years ago

Hi all,

I've been wondering whether this could be solved by using preceeding tags, as introduced in Rule 6. In this case one could define a certain tag that specifies one of the remaining single letters, "J is ambiguous for I and L". However, I'm not sure if this is what you have in mind in regard of simplicity.

trishorts commented 7 years ago

moved discussion of rule 3 to new issue

stefanks commented 7 years ago

Ambiguity is not currently part of the standard, but I believe it is a good idea to transparently allow AA-level ambiguity without compromising the standard and without introducing new notation. For ambiguous AA, see https://en.wikipedia.org/wiki/Proteinogenic_amino_acid

B: Asparagine or aspartic acid J: Leucine or isoleucine X: Unknown Z: Glutamic acid or glutamine

acesnik commented 6 years ago

Thank you for this discussion, all. I am going to close this issue, since it was addressed in Rule 1 of the ProForma standard (published here).

Namely, we allowed J, B, and Z to be used. We also allowed U to note selenocysteine and O to note pyrrolysine. We forbade X because it is used for undetermined amino acids, where ProForma is intended to annotate nearly/fully characterized proteoforms.