topdownproteomics / ProteoformNomenclatureStandard

ProForma, a Proteoform Notation Standard
https://topdownproteomics.github.io/ProteoformNomenclatureStandard/
4 stars 5 forks source link

Notating Localization Ambiguity in ProForma #30

Open acesnik opened 6 years ago

acesnik commented 6 years ago

Overview

We should build a way to specify ambiguity of localization in ProForma:

  1. Modifications localized to one of several amino acid options (e.g., a phosphorylation on a T, S, or Y in a proteoform)
  2. Regions of ambiguity, such as an unidentified mass on a fragment
  3. Ambiguous localization along a whole proteoform sequence (see also #21)

Proposal 1: Four new keys

Add four new keys to specify ambiguity:

  1. #, noting one of several sites that may be assigned a single modification
  2. ->, noting the left boundary of a range of the sequence to which a modification may be localized
  3. <-, noting the right boundary of such a range
  4. <->, noting a modification has ambiguous localization along a whole proteoform sequence, used before the first amino acid of the sequence

The value of the key-value-pair is a unique string grouping the ambiguous localization sites

Examples:

  1. PROT[Phospho|#:eg]EOS[#:eg]FORMS[#:eg]

    • This sequence has a phosphorylation with ambiguous localization on either T4 or S12.
    • Note S7 is excluded from this group, e.g., by an identified internal fragment.
  2. PROT[mass:19|->:A]EOSFORMS[<-:A]

    • This sequence has a modification with ambiguous localization across a range.
    • Note S7 is included in this group.
    • The A values of the key-value pair groups the tags. This allows overlapping regions to be disambiguated.
  3. [mass:19|Phospho|<->:]PROTEOSFORMS

    • Some number of modifications completely unlocalized, e.g., by MS1 only.
    • The value of descriptors with the key "<->" can be any string, since the groupings are not important. (This colon is kind of ugly and addressed in proposal 3).

Proposal 2: Special prefixes and suffixes

This proposal places more emphasis on human readability for annotating localization ambiguity and less emphasis on continuing the key:value structure from the first version of ProForma.

Add four special strings to group ambiguity. These are followed or preceded by a unique string grouping the ambiguous localization sites.

  1. "#" as a prefix
  2. "->" as a suffix
  3. "<-" as a prefix
  4. "<->" alone

Examples:

  1. PROT[Phospho|#eg]EOS[#eg]FORMS[#eg]
  2. PROT[mass:19|A->]EOSFORMS[<-A]
  3. [mass:19|Phospho|<->]PROTEOSFORMS

Note: There are currently no Unimod entries that contain these special prefixes, suffixes, or standalone strings, but if one were introduced, it would cause a collision.

Proposal 3: New keys and one special string

This proposal is a compromise of the two former proposals, taking the key:value pair continuation from the first version of the proposal, but using the special string for annotating global modifications.

Add three new keys to specify ambiguity:

  1. #, noting one of several sites that may be assigned a single modification
  2. ->, noting the left boundary of a range
  3. <-, noting the right boundary of a range

Add one special string to specify unlocalized modifications:

  1. <->, noting a modification has ambiguous localization along a whole proteoform sequence, used before the first amino acid of the sequence

The value of the key-value-pair is a unique string grouping the ambiguous localization sites

Examples:

  1. PROT[Phospho|#:eg]EOS[#:eg]FORMS[#:eg]
  2. PROT[mass:19|->:A]EOSFORMS[<-:A]
  3. [mass:19|Phospho|<->]PROTEOSFORMS

Example 3 differs from Proposal 1 by dropping the colon character.

acesnik commented 6 years ago

Maybe that was a bit much to start discussion. Here are the main questions:

  1. Do you like using "#" and the arrows to specify ambiguity?
  2. Do you want to keep building off of the key:value structure (easier to amend rules, Proposal 1 and 3), or do you want to use special prefixes and suffixes (easier human readability, Proposal 2)?