rformassspectrometry / unimod

Amino acid modifications for mass spectrometry
6 stars 0 forks source link

How to create own Modifications with specific location? #2

Open sgibb opened 7 years ago

sgibb commented 7 years ago

This issue is a follow-up to https://github.com/lgatto/MSnbase/issues/167.

@adder asked for user-defined locations of the Modification:

You also have to be able to specify where the variable modificiation is positioned. Eg. if you have a sequence with 2 serines. You can have a Phospo group on the first but not on the second serine.

unimod.org doesn't support specific locations. Instead they have a position argument that could be: "Anywhere", "Any N-term", "Any C-term", "N-term", "C-term", "Protein N-term", "Protein C-term".

library("unimod")
umodName("Acetyl") # or umodId(1)
# - General:
#   Modification version   :          0.0.1
#   Accession number/id    :              1
#   PSI-MS/Interim Name    :         Acetyl
#   Description            :    Acetylation
#   Composition            : H(2) C(2) O(1)
#   Delta Average Mass     :        42.0367
#   Delta Monoisotopic Mass:      42.010565
#   Approved               :           TRUE 
# - Specificity:
#     site       position      classification hidden group
# 1      K       Anywhere            Multiple  FALSE     1
# 2 N-term     Any N-term            Multiple  FALSE     2
# 3      C       Anywhere  Post-translational   TRUE     3
# 4      S       Anywhere  Post-translational   TRUE     4
# 5 N-term Protein N-term  Post-translational  FALSE     5
# 6      T       Anywhere  Post-translational   TRUE     6
# 7      Y       Anywhere Chemical derivative   TRUE     7
# 8      H       Anywhere Chemical derivative   TRUE     8
# - References: use 'references(object)'

I am currently not sure how to fulfill @adder's feature request best. The specificity slot is a data.frame that has a column position of type character. We could add another column, e.g. index as numeric or the user has to supply the position number as character (or we cast it).

I am planning to add a seq2mass(sequence, modificitions) or calculateMass(sequence, modifications) or just mass(sequence, modifications) function that would do the following:

  1. Split the amino acid sequence into single letter character.
  2. Look up the mass of these amino acids in the aminoacid data.frame (see #1).
  3. Check the specificity slot (site and position column) if the mass has to be modified.
  4. Sum the mass up

This function should replace the mass calculation in MSnbase::calculateFragments.

Any suggestions regarding the user-defined modification positions or something else?

sgibb commented 7 years ago

There are attempts to create a nomenclature for proteoforms https://github.com/topdownproteomics/proteoform-nomenclature-standard (tl;dr the modification is written in square brackets behind the amino acid).

e.g. Carbamidomethyl at the first C:

AC[Unimod:4]E

an unspecific (not in a database) mass shift at the second amino acid:

AC[mass:+16]E

modification on n-term:

[mass:+16]-ACE

In our Modification class or for calculateFragments we have to handle at least these five conditions:

  1. global modifications, like: Carbamidomethyl, which replaces all C with C + 57.02146
  2. local modifications, like: Phosphorylation on the second Serin in the sequence
  3. modification on the n/c-term (protein)
  4. modification on the n/c-term (any)
  5. neutral loss

2 and 3 are handled by the current version of the standard.

For 1. and 5. there are some issues (https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/21, https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/18, but I don't expect that any of them will be integrated because it is a little bit out of scope).

We could extend the mentioned standard, e.g.

  1. for global modifications: use a prefix, e.g. C[Unimod:4]>ACE
  2. use same notation as for 3. (would be only interesting for result presentation of calculateFragments)
  3. use * for ammonia and _ for water loss (would be only interesting for result presentation of calculateFragments), e.g. ACE*

A user could call calculateFragments in the following way:

calculateFragments("[Unimod]+C[4]>CS_E_QU[mass:+16]E_NCE_]")
# the unimod package would collect the mass for C

Which would be equivalent to the current

calculateFragments("CSEQUENCE", modifications=c(C=57.02146, U=16), 
                   neutraLoss=list(water=c("Cterm", "D", "E", "S", "T"),
                                   ammonia=c("K", "N", "Q", "R")))

For convenience we could keep the neutralLoss argument (otherwise the sequence would be really destroyed by many _, * for possible neutral loss positions).

While I think that would a great interface for the user if he only wants to calculate fragments for just a single protein sequence but it would be nearly impossible to do this for batch processing of mzML + mzID files.

In summary it seems not suitable for our intention.