Open sgibb opened 7 years ago
I think it depends a bit on the use cases you envision. I am not convinced that several classes are really necessary here, and a data.frame
with additional rows, similarly to the table above, seems like it could do the trick.
There could be a helper function to create new modification as new rows in the data.frame
, so that the user doesn't need to create them manually - that function could do some checks, fill out values that can be calculated automatically, ...
Maybe you are right and we should not overcomplicate things. We could avoid a class completely.
I converted all the unimod entries into a data.frame
(currently without the NeutralLoss and reference information) and beside a lot of data duplication it takes just around 1 MB of memory:
head(d)
# id name description lastModified approved avgMass monoMass
# 1 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.1 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.2 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.3 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.4 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# 1.5 1 Acetyl Acetylation 2008-02-15 05:20:02 1 42.0367 42.010565
# composition site position classification hidden group
# 1 H(2)C(2)O(1) K Anywhere Multiple FALSE 1
# 1.1 H(2)C(2)O(1) N-term Any N-term Multiple FALSE 2
# 1.2 H(2)C(2)O(1) C Anywhere Post-translational TRUE 3
# 1.3 H(2)C(2)O(1) S Anywhere Post-translational TRUE 4
# 1.4 H(2)C(2)O(1) N-term Protein N-term Post-translational FALSE 5
# 1.5 H(2)C(2)O(1) T Anywhere Post-translational TRUE 6
> dim(d)
# [1] 2370 13
> print(object.size(d), units="Kb")
# 908.1 Kb
By converting some of the columns into factor
and Rle
, removing some useless columns (lastModification
, group
) and adding around 1000 rows because of incorporating NeutralLoss information that would change a bit. Nevertheless I think it would be acceptable to store the whole unimod.xml
in a data.frame
(and keep it with the amino acid and element information in data
). In that case we could also move xml2
from Depends
to Suggests
. The reference information is IMHO negligible. If anybody wants to know where a modification was described/published he could look it up at http://unimod.org.
Instead of a Modification
class there could be a simple function that creates a modification data.frame
for calculateFragments
, etc. This function could look up the unimod information in the unimod
data.frame
.
I think it's good to keep things as simple as possible, at least in a first stage. If necessary, it's possible to encapsulate the data in a class of the need becomes clear.
There are three data.frame
s in the /data
directory now (containing all information from uniprot except the references and notes):
library("unimod")
data("elements")
head(elements)
Name | FullName | AvgMass | MonoMass | |
---|---|---|---|---|
H | H | Hydrogen | 1.007940 | 1.007825 |
2H | 2H | Deuterium | 2.014102 | 2.014102 |
Li | Li | Lithium | 6.941000 | 7.016003 |
C | C | Carbon | 12.010700 | 12.000000 |
13C | 13C | Carbon13 | 13.003355 | 13.003355 |
N | N | Nitrogen | 14.006700 | 14.003074 |
data("aminoacids")
head(aminoacids)
OneLetter | ThreeLetter | FullName | AvgMass | MonoMass | H | C | N | O | S | Se | |
---|---|---|---|---|---|---|---|---|---|---|---|
- | - | 0.0000 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | ||
A | A | Ala | Alanine | 71.0779 | 71.03711 | 5 | 3 | 1 | 1 | 0 | 0 |
R | R | Arg | Arginine | 156.1857 | 156.10111 | 12 | 6 | 4 | 1 | 0 | 0 |
N | N | Asn | Asparagine | 114.1026 | 114.04293 | 6 | 4 | 2 | 2 | 0 | 0 |
D | D | Asp | Aspartic acid | 115.0874 | 115.02694 | 5 | 4 | 1 | 3 | 0 | 0 |
C | C | Cys | Cysteine | 103.1429 | 103.00919 | 5 | 3 | 1 | 1 | 1 | 0 |
data("modifications")
head(modifications)
Id | Name | Description | Composition | AvgMass | MonoMass | Site | Position | Classification | SpecGroup | LastModified | Approved | Hidden |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | K | Anywhere | Multiple | 1 | 2017-11-08 16:08:56 | TRUE | FALSE |
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | N-term | Any N-term | Multiple | 2 | 2017-11-08 16:08:56 | TRUE | FALSE |
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | C | Anywhere | Post-translational | 3 | 2017-11-08 16:08:56 | TRUE | TRUE |
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | S | Anywhere | Post-translational | 4 | 2017-11-08 16:08:56 | TRUE | TRUE |
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | N-term | Protein N-term | Post-translational | 5 | 2017-11-08 16:08:56 | TRUE | FALSE |
1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | T | Anywhere | Post-translational | 6 | 2017-11-08 16:08:56 | TRUE | TRUE |
We could turn each of these data.frame
s into a DataFrame
(with Rle
or similar) or into a tibble
but I don't think it is necessary because the whole database is small:
print(object.size(modifications), units="KB")
# 725.4 Kb
Currently unimod
is a very small package providing just these 3 data.frame
s and has no dependencies (the hidden functions to create the data.frame
s need the xml2
package that's why it is in Suggests:
).
The aminoacids
and elements
data.frame
could replace MSnbase
's amino.acids
data.frame
and atomic.mass
vector (in R/environments.R
).
Do you like these data.frame
s and their format or should we provide something different?
For anything that is like a data.frame
, tidy tools are superior when it comes to data wrangling. Still, I don't think we need to depend on tibble
, as the conversion can be done by the user, if required. Unless, of course, we envision some sort or direct analysis ourselves where tibble
s would be a better fit.
Yes, I would suggest to use the data in MSnbase
and make use of unimod
. The latter would probably have to be submitted to Bioconductor first, though.
While the elements
and aminoacids
data.frame
s are very useful now. The
modification
data.frame
is more or less useless. E.g. in topdownr
we
support 3 modifications (Carbamidomethyl, Acetyl, Met-loss; unimod id 4, 1,
765).
library("unimod")
data("modifications")
subset(modifications, Id %in% c(1, 4, 765) & Classification != "Artefact")
Id | Name | Description | Composition | AvgMass | MonoMass | Site | Position | Classification | SpecGroup | LastModified | Approved | Hidden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | K | Anywhere | Multiple | 1 | 2017-11-08 16:08:56 | TRUE | FALSE |
2 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | N-term | Any N-term | Multiple | 2 | 2017-11-08 16:08:56 | TRUE | FALSE |
3 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | C | Anywhere | Post-translational | 3 | 2017-11-08 16:08:56 | TRUE | TRUE |
4 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | S | Anywhere | Post-translational | 4 | 2017-11-08 16:08:56 | TRUE | TRUE |
5 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | N-term | Protein N-term | Post-translational | 5 | 2017-11-08 16:08:56 | TRUE | FALSE |
6 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | T | Anywhere | Post-translational | 6 | 2017-11-08 16:08:56 | TRUE | TRUE |
7 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | Y | Anywhere | Chemical derivative | 7 | 2017-11-08 16:08:56 | TRUE | TRUE |
8 | 1 | Acetyl | Acetylation | H(2) C(2) O | 42.0367 | 42.01056 | H | Anywhere | Chemical derivative | 8 | 2017-11-08 16:08:56 | TRUE | TRUE |
14 | 4 | Carbamidomethyl | Iodoacetamide derivative | H(3) C(2) N O | 57.0513 | 57.02146 | C | Anywhere | Chemical derivative | 1 | 2017-10-09 10:27:10 | TRUE | FALSE |
23 | 4 | Carbamidomethyl | Iodoacetamide derivative | H(3) C(2) N O | 57.0513 | 57.02146 | U | Anywhere | Chemical derivative | 10 | 2017-10-09 10:27:10 | TRUE | TRUE |
24 | 4 | Carbamidomethyl | Iodoacetamide derivative | H(3) C(2) N O | 57.0513 | 57.02146 | M | Anywhere | Chemical derivative | 11 | 2017-10-09 10:27:10 | TRUE | TRUE |
25 | 4 | Carbamidomethyl | Iodoacetamide derivative | H(7) C(3) N O S | 105.1588 | 105.02483 | M | Anywhere | Chemical derivative | 11 | 2017-10-09 10:27:10 | TRUE | TRUE |
1170 | 765 | Met-loss | Removal of initiator methionine from protein N-terminus | H(-9) C(-5) N(-1) O(-1) S(-1) | -131.1961 | -131.04048 | M | Protein N-term | Co-translational | 1 | 2007-07-15 20:01:35 | FALSE | TRUE |
While we could use the modifications
data.frame
to find the modification and
its mass difference we also need the "modification rule", e.g. for Acetylation
the "Any N-term" rule (so we add the mass if our fragment starts with the
N-terminal end of the sequence), for Carbamidomethyl the "C|U" rule
(we add the mass if "C" or "U" is present in the sequence) and for Met-loss the
"remove M at the beginning" rule (we remove "M" from the start of the sequence
and substract 131 from the peptide mass).
Here you could see the implementation in topdownr
:
.unimod1 <- function(x, s) {
i <- startsWith(s, x$seq)
x$mz[i] <- x$mz[i] + 42.010565
x
}
.unimod4 <- function(x) {
iCU <- grep("C|U", x$seq)
x$mz[iCU] <- x$mz[iCU] + 57.021464
x
}
.unimod765 <- function(x) {
gsub("^M([ACGPSTV])", "\\1", x)
}
All these rules could be written in regular expressions (one for finding the
pattern and in some cases a second one for replacement). But there is no way for
me to write nrow(modifications) == 3458
regular expressions.
Unfortunately sometimes the rule could not be predicted from the Site
and
Position
column because the details are written in special notes, e.g. for
Met-loss:
N-terminal initiator methionine is removed by a methionine aminopeptidase from proteins where the residue following the methionine is Ala, Cys, Gly, Pro, Ser, Thr or Val. This is generally the final N-terminal state for proteins where the following residue was a Cys, Pro or Val.
But often the notes just contain less useful information (that's why the notes
are not included in the data.frame
):
"observed in monoclonal antibodies" "Covalently bound structure in Manglik et al., Fig. 1b-Fig1c. Chemical formula % Sigma catalog entry." "Triton X-114" "GEE (glycine ethyl ester) is a substrate for the enzyme Factor XIII for cross-linking to fibrinogen" ...
I would like to have an interface like calculatePeptideMass(peptideSequence, fixedModifications=unimodIds, variableModifications=unimodIds, neutralLoss=TRUE)
that could be used in MSnbase::calculateFragements
and similar functions.
(Would be great to have impact from mass spec users here for a better interface regarding the fixed/variable modifications.)
What we could do: Writing the regular expressions for 3-10 often used
modifications and set everything else as NA
. If somebody wants to use a
modification with NA
pattern he would get a message to open an issue on github
for implementing this rule.
Alternatively we just provide the modification data.frame
with the delta mass
(and remove the other columns) and let the user implement the rule himself (as I
did in topdownr
). This would at least reduce the need for hardcoding the
delta mass.
What we could do: Writing the regular expressions for 3-10 often used modifications and set everything else as
NA
. If somebody wants to use a modification withNA
pattern he would get a message to open an issue on github for implementing this rule.
I like this approach because it makes the package useful for what you need right now without overwhelming you with tons of unnecessary stuff, but allows users to extend or asks for useful extensions.
I implemented the first prototype of a function to calculate the mass for peptides and allow fixed custom and unimod modifications (the unimod modifications are used by their short names, colon, site):
library("unimod")
unimod:::.mass("MACE",
fixedModifications=c("Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 533.1614
# attr(,"sequence")
# [1] "MACE"
unimod:::.mass("MACE",
fixedModifications=c("Met-loss:P-M",
"Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 402.1209
# attr(,"sequence")
# [1] "ACE"
unimod:::.mass(c("ACE", "MACE", "CDE"),
fixedModifications=c("Met-loss:P-M",
"Acetyl:N-term",
"Carbamidomethyl:C"))
# [1] 402.1209 402.1209 446.1107
# attr(,"sequence")
# [1] "ACE" "ACE" "CDE"
unimod:::.mass(c("ACE", "MACE", "CDE"), fixedModifications="Unknown:420:N-term")
# [1] 723.1397 854.1802 767.1296
# attr(,"sequence")
# [1] "ACE" "MACE" "CDE"
#
# Applying the default rule for the modification: Unknown:420:N-term
# Please create an issue on: https://github.com/ComputationalProteomicsUnit/unimod/issues/new
# to let us implement the correct rule or if the default one is already correct we could remove
# this message.
unimod:::.mass(c("ACE", "MACE", "CDE"),
fixedModifications=data.frame(
Id=c("MyModification1",
"MyModification2"),
Site=c("C", "D"),
MonoMass=c(57, 58),
stringsAsFactors=FALSE))
# [1] 360.0889 491.1294 462.0787
# attr(,"sequence")
# [1] "ACE" "MACE" "CDE"
I am going to implement the variable modifications next. @pavel-shliaha, @adder, @yafeng any suggestion for the interface?
Currently I am thinking an additional argument named variableModifications
that takes a data.frame
with the columns Id, Site (Aminoacid), Location (Position in the peptide chain), DeltaMass would be sufficient.
Does anyone have a good suggestion for a name for this function? I don't like names that contain more or less useless verbs calculateMass
, determineMass
, getMass
(I know we have MSnbase::calculateFragments
; I am not sure why we not simply used fragments
that time?!). That's why I vote for mass
(but this is very generic).
If you want mass
, you might need to consider a method mass,character
and use the generic in ProtGenerics
. Otherwise, what about pepmass
, to get the mass of a peptides (with optional fixed or variable modifications passed as arguments).
pepmass
is much more specific. Thanks. I guess the bioc reviewer will "force" me to provide a method for AAString
, AAStringSet
and AAStringSetList
anyway (they did so for the cleave
method in cleaver
). So pepmass,character
, pepmass,AAString
etc. would be fine.
Hey, Dataframes sound ok for me. I also usually represent these type of objects as dataframes with in my code, it's sufficiently flexible. It fits nicely with my mainly dplyr/tidyverse oriented workflow :)
Mosty tricky thing is probably specyfing terminal modifiations. Maybe position 0 for N-terminus and length(pep)+1 for C-terminus?
Regarding the function name. If mass
is to general, you could also call it peptide_mass
.
Ok, I was to slow with my comments :) Sorry
Let's not get into the CamelCam vs snake_case vs alllowsercase debate ;-)
Surely we all agree not to use ALLUPPERCASE.
I like pepmass
because it calculates the mass for peptides (ok it could be a protein as well). I assume we need to create methods because we should support character
and AAString*
.
I actually wondering what should happen if a variable modification and a fixed modification hit the same site:
I'm not an expert in these matters and I can't supply real biological examples right now but I would say that both should be applied by default. In the case that both can be applied, you allow for this. If the fixed modification blocks the variable modification, it's up to the user to correctly specify the variable modification by not allowing it on the same site that the fixed is on.
A difficult one is the case that the variable blocks a fixed modification, I guess an option variable_only = TRUE
could help here. (the default of this option would be FALSE then)
I'm not sure if this is a problem in a real example but what happens if you have 2 variable modifications that can be on the same site?
I suppose it depends whether the modifications can co-occur or whether they compete. I would say that it is the user's responsibility to make sure that sites undergo only a single modification (whether variable of fixed); if > 1 modifications are provided, I think we should consider all possibilities: mod1 only, mod2 only, mod1 and 2, or none, if both are variable.
I could ask in the lab if this is an issue in practice.
I think that 2 modifications can co-exist in principle (chemically), but I have never seen 2 modifications reported on the same residue. I think if you want them both then just create a new modification that contains both of them. Maybe make a function that combines them. And fixed modification should beat variable modification. This is just my opinion of course
The only real example I can think of is trimethylation of lysine. There are 3 modifications.
1) K methyl 14.015650 2) K dimethyl 28.031300 3) K trimethyl 42.046950
the mass of Kme3 modification is exactly identical to Kme2 + Kme1, however I have never seen an identification with 2 modifications Kme2 + Kme1. All modifications Kme3 are reported as Kme3.
I agree you should store NL as a column of a dataframe.
see below my email exchange with people who work on simultaneous modifications in Mascot. They say mascot does not put 2 modifications on the same reisude
Dear Pavel,
No, Mascot will never suggest two simultaneous modifications on the same residue. In such a case it would try to allocate one of the modifications to a different residue if another possible target is present in the peptide. If you expect to see this, you should specify it as a separate modification.
Best, Tina
From: Pavel V. Shliaha [mailto:pavels@bmb.sdu.dk] Sent: 25. januar 2018 13:42 To: Tina Nybo; Adelina Rogowska-Wrzesinska Subject: 2 modifications on the same residue
Dear Tina and Adelina,
I know you work with some very weird modifications in oxidation field and hence I wanted to ask if you have ever come across residues that could be modified in 2 places, e.g. oxidation + chlorination. If so how do you handle that in a database search. Do you specify modifications separately as dynamic and mascot knows a combination on a single residue is possible or do you create a new modification that contains both the (say chlorination + oxidation?)
Pavel
So, bottom line is that search engine don't seem to support multiple modification. I suggest that if such a cases arises, to calculate the masses for
I don't like the idea that a user has to create a new virtual modification composed of two individual ones. As search engines won't support this, a warning or message should then inform the user.
And to follow up from Pavel's example, trimethylation would be a single modification, of course.
@sgibb and @lgatto just a quick opinion from a more top-down perspective
1) @sgibb could you please provide an output you imagine your function will give when you submit a seqeunce and a variable modification.
unimod:::.mass("KKK", varModifications=data.frame( Id=c("acetyl"), Site=c("K"), MonoMass=c(42), stringsAsFactors=FALSE))
will it be a vector of all possible permutations of K modification masses, i.e. singly, doubly and triply acetylated? I can suggest 3 different proteoforms with identical mass KacKK, KKacK and KKKac. How will this be reflected if the output is just monoisotopic mass? (please let me know if you are open to suggestions on these points)
2) As a top-down person can I suggest you create an additional column which by default is all. I sometimes want a fixed/varible modification but not on all K, but only on a particular. E.g. I know first K is acetylated but 2nd one can only be trimethylated.
3) Just to let you know: the vast majority of modifications cannot co-exist with others. The reason for this is simple: a modification needs certain physico-chemical properties to be attached to an amino acid. However other modifications to this residue destroy these properties. Given co-existance is extremely rare I suggest not to not calculate co-existing modifications by default, but perhaps to provide an interface that allows user to say which modifications can co-exist.
Lets assume there are 5 K (KKKKK) residues, each of which can be mono-, di, trimethylated and acetylated. MS1 mass tells us there are 3 methylations and 1 acetylations Even without co-existing modifications we already have a huge space of possibilities of proteoform combinations. E.g. KmeKmeKmeKacK and Kme3KacKKK and so on. If you do consider all can co-exist the number of combinations becomes almost infinite.
@adder, @lgatto and @pavel-shliaha thanks for your great input and sorry for the delayed answer.
First I have to admit that my understanding of fixed/variable modifications was quite different. So to have everybody on the same page I would define the terms now as follows:
fixed: modification that is always present, could have two characteristics:
KmeKmeKme
KmeKK
, or metyhl at K1 and 2: KmeKmeK
variable: modification could happen at none, one, multiple or all residues without knowing the position a priori.
Currently unimod
just supports fixed/all. I am going to implement fixed/specific next.
@sgibb could you please provide an output you imagine your function will give when you submit a seqeunce and a variable modification.
unimod:::.mass("KKK", varModifications=data.frame( Id=c("acetyl"), Site=c("K"), MonoMass=c(42), stringsAsFactors=FALSE))
Current output would be:
mass | sequence | modifications |
---|---|---|
510.3166 | KKK | Acetyl:K |
(because it is KacKacKac)
will it be a vector of all possible permutations of K modification masses, i.e. singly, doubly and triply acetylated? I can suggest 3 different proteoforms with identical mass KacKK, KKacK and KKKac. How will this be reflected if the output is just monoisotopic mass? (please let me know if you are open to suggestions on these points)
As you assumed currently I just return the monoisotopic mass. So if fixed/specific modifications are available the output would be 426
for KacKK
, KKacK
and KKKac
.
Of course I am open for suggestions and discussions.
As a top-down person can I suggest you create an additional column which by default is all. I sometimes want a fixed/varible modification but not on all K, but only on a particular. E.g. I know first K is acetylated but 2nd one can only be trimethylated.
I think that is what I want to provide with the fixed/specific method.
Just to let you know: the vast majority of modifications cannot co-exist with others. The reason for this is simple: a modification needs certain physico-chemical properties to be attached to an amino acid. However other modifications to this residue destroy these properties. Given co-existance is extremely rare I suggest not to not calculate co-existing modifications by default, but perhaps to provide an interface that allows user to say which modifications can co-exist.
Good suggestion. That would be easier to implement.
Lets assume there are 5 K (KKKKK) residues, each of which can be mono-, di, trimethylated and acetylated. MS1 mass tells us there are 3 methylations and 1 acetylations Even without co-existing modifications we already have a huge space of possibilities of proteoform combinations. E.g. KmeKmeKmeKacK and Kme3KacKKK and so on. If you do consider all can co-exist the number of combinations becomes almost infinite.
I see your point. With the current implementation it doesn't matter because it will only return the monoisotopic mass. But I guess that is not the information you are interested in, or?
The current
Modification
object is a simple class that contains mainly the unimod ID, the composition and the avg/mono mass of the modification as integer/double vectors of length 1. Additionally it contains adata.frame
named specificity that stores information about the site and the position of the modification, e.g.:Some specificities have additional entries in the unimod database for neutral loss (#3). These entries have their own avg/mono mass (sometimes different from the general modification mass, e.g. Phosphorylation, id=21).
We could create a new class
NeutralLoss
that stores these information and could be attached to a specificity (which maybe should be also a class, so that we could handle different user-defined locations easier; see #2). But before creating two new classes I like to ask whether anyone has a better idea? Maybe we overcomplicate things. Maybe adata.frame
(with some duplicated entries in some columns) would fit and a complicated class hierarchy is just overkill.Class hierarchy would be:
vs. a
data.frame
where all these slots would be columns.There is the
mzID
package that has a complex class hierarchy and many classes but in fact just turns a mzIdentML file into adata.frame
(nearly identical use case). I don't want to create classes just because it is possible. The user should benefit from them and should be allowed to create modifications forcalculateFragments
and other functions.@lgatto do you have a better idea for the data structure?