How to create own Modifications with specific location?

library("unimod") umodName("Acetyl") # or umodId(1) # - General: # Modification version : 0.0.1 # Accession number/id : 1 # PSI-MS/Interim Name : Acetyl # Description : Acetylation # Composition : H(2) C(2) O(1) # Delta Average Mass : 42.0367 # Delta Monoisotopic Mass: 42.010565 # Approved : TRUE # - Specificity: # site position classification hidden group # 1 K Anywhere Multiple FALSE 1 # 2 N-term Any N-term Multiple FALSE 2 # 3 C Anywhere Post-translational TRUE 3 # 4 S Anywhere Post-translational TRUE 4 # 5 N-term Protein N-term Post-translational FALSE 5 # 6 T Anywhere Post-translational TRUE 6 # 7 Y Anywhere Chemical derivative TRUE 7 # 8 H Anywhere Chemical derivative TRUE 8 # - References: use 'references(object)'

There are attempts to create a nomenclature for proteoforms https://github.com/topdownproteomics/proteoform-nomenclature-standard (tl;dr the modification is written in square brackets behind the amino acid).

e.g. Carbamidomethyl at the first C:

AC[Unimod:4]E

an unspecific (not in a database) mass shift at the second amino acid:

AC[mass:+16]E

modification on n-term:

[mass:+16]-ACE

In our Modification class or for calculateFragments we have to handle at least these five conditions:

global modifications, like: Carbamidomethyl, which replaces all C with C + 57.02146
local modifications, like: Phosphorylation on the second Serin in the sequence
modification on the n/c-term (protein)
modification on the n/c-term (any)
neutral loss

2 and 3 are handled by the current version of the standard.

For 1. and 5. there are some issues (https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/21, https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/18, but I don't expect that any of them will be integrated because it is a little bit out of scope).

We could extend the mentioned standard, e.g.

for global modifications: use a prefix, e.g. C[Unimod:4]>ACE
use same notation as for 3. (would be only interesting for result presentation of calculateFragments)
use * for ammonia and _ for water loss (would be only interesting for result presentation of calculateFragments), e.g. ACE*

A user could call calculateFragments in the following way:

calculateFragments("[Unimod]+C[4]>CS_E_QU[mass:+16]E_NCE_]")
# the unimod package would collect the mass for C

Which would be equivalent to the current

calculateFragments("CSEQUENCE", modifications=c(C=57.02146, U=16), 
                   neutraLoss=list(water=c("Cterm", "D", "E", "S", "T"),
                                   ammonia=c("K", "N", "Q", "R")))

For convenience we could keep the neutralLoss argument (otherwise the sequence would be really destroyed by many _, * for possible neutral loss positions).

While I think that would a great interface for the user if he only wants to calculate fragments for just a single protein sequence but it would be nearly impossible to do this for batch processing of mzML + mzID files.

In summary it seems not suitable for our intention.

rformassspectrometry / unimod

How to create own Modifications with specific location? #2