Open sgibb opened 7 years ago
There are attempts to create a nomenclature for proteoforms https://github.com/topdownproteomics/proteoform-nomenclature-standard (tl;dr the modification is written in square brackets behind the amino acid).
e.g. Carbamidomethyl at the first C
:
AC[Unimod:4]E
an unspecific (not in a database) mass shift at the second amino acid:
AC[mass:+16]E
modification on n-term:
[mass:+16]-ACE
In our Modification
class or for calculateFragments
we have to handle at least these five conditions:
2 and 3 are handled by the current version of the standard.
For 1. and 5. there are some issues (https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/21, https://github.com/topdownproteomics/proteoform-nomenclature-standard/issues/18, but I don't expect that any of them will be integrated because it is a little bit out of scope).
We could extend the mentioned standard, e.g.
C[Unimod:4]>ACE
calculateFragments
)*
for ammonia and _
for water loss (would be only interesting for result presentation of calculateFragments
), e.g. ACE*
A user could call calculateFragments
in the following way:
calculateFragments("[Unimod]+C[4]>CS_E_QU[mass:+16]E_NCE_]")
# the unimod package would collect the mass for C
Which would be equivalent to the current
calculateFragments("CSEQUENCE", modifications=c(C=57.02146, U=16),
neutraLoss=list(water=c("Cterm", "D", "E", "S", "T"),
ammonia=c("K", "N", "Q", "R")))
For convenience we could keep the neutralLoss
argument (otherwise the sequence would be really destroyed by many _
, *
for possible neutral loss positions).
While I think that would a great interface for the user if he only wants to calculate fragments for just a single protein sequence but it would be nearly impossible to do this for batch processing of mzML + mzID files.
In summary it seems not suitable for our intention.
This issue is a follow-up to https://github.com/lgatto/MSnbase/issues/167.
@adder asked for user-defined locations of the Modification:
unimod.org doesn't support specific locations. Instead they have a position argument that could be:
"Anywhere", "Any N-term", "Any C-term", "N-term", "C-term", "Protein N-term", "Protein C-term"
.I am currently not sure how to fulfill @adder's feature request best. The
specificity
slot is adata.frame
that has a columnposition
of typecharacter
. We could add another column, e.g.index
asnumeric
or the user has to supply the position number as character (or we cast it).I am planning to add a
seq2mass(sequence, modificitions)
orcalculateMass(sequence, modifications)
or justmass(sequence, modifications)
function that would do the following:character
.aminoacid
data.frame
(see #1).specificity
slot (site
andposition
column) if the mass has to be modified.This function should replace the mass calculation in
MSnbase::calculateFragments
.Any suggestions regarding the user-defined modification positions or something else?