Open lgatto opened 5 years ago
A quick input from my collegues (not sure how this can be implemented):
1) There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM 2) there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met
While I didn't work on this since the EuroBioc 2018 I now pushed my draft with the classes into the ModificationClass branch.
There I have a complex class hierarchy implemented:
Modification (VIRTUAL)
|- ModificationFixed # slot: @site, fixed modification, that is always applied on each occurence
| |- ModificationFixedLocalized # slot: @globalIndex: apply mod if on the `globalIndex`th amino acid in the sequence, @siteIndex: apply mod if on the `siteIndex` occurence of the specific amino acid/pattern
| | `- ModificationVariable # slot: @globalMaxN, @siteMaxN: maximal occurrence in the whole peptide sequence or at a specific amino acid
| |
| `- ModificationFixedRegExp # slot @replacement: apply mod and replace amino acid
|
`- ModificationTerm (VIRTUAL)
|- ModificationNterm # apply N-term modification (e.g. neutral loss)
`- ModificationCterm # apply C-term modification
The user just needs to call the universal constructor UnimodModification("UnimodDescription")
and the corresponding class is generated. Some modifications, e.g. *Met-loss+Acetyl:P-M" are actually combined modifications. That's why UnimodModification
always returns a list
.
A simple example:
x <- "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"
m <- c(
UnimodModification("Met-loss+Acetyl:P-M"),
UnimodModification("Acetyl:K"),
UnimodModification("Acetyl:N-term"),
UnimodModification("Acetyl:S", fixed=FALSE, globalIndex=30, siteIndex=3:5)
# 30 40 49 54 70
)
m
sapply(m, class)
# [1] "ModificationFixedRegExp" "ModificationFixedRegExp"
# [3] "ModificationFixed" "ModificationNterm"
# [5] "ModificationVariable"
lapply(m , .countSites, pepseq=x)
# [[1]]
# [1] 0
#
# [[2]]
# [1] 0
#
# [[3]]
# [1] 4
#
# [[4]]
# [1] 1
#
# [[5]]
# [1] 0 1 2 3 4 5
.pepseq(m[[1]], x)
# [1] "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"
deltaMass(m, x)
# [[1]]
# [1] 0
#
# [[2]]
# [1] 0
#
# [[3]]
# [1] 168.0423
#
# [[4]]
# [1] 42.01056
#
# [[5]]
# [1] 0.00000 42.01056 84.02113 126.03169 168.04226 210.05282
@tfkillian: Feel free to completely remove this draft or build upon it. Your are welcome to ask any question. I am happy to help and review your code if you wish.
(I wrote this at the train journey back home from the EuroBioc2018. I didn't touch it since then. Unfortunately there is no documentation and no unit tests in the ModificationClass branch.)
Some notes I had taken at the EuroBioc 2018 (some may be outdated or do not correspond to the draft in the ModificationClass
branch.
2 use cases: modify mass and modify sequence (somehow coupled)
unimod:::.mass("MACE",
fixedModifications="Carbamidomethyl:C")
## [1] 491.1508
## attr(,"sequence")
## [1] "MACE"
unimod:::.mass(c("ACE", "MACE", "CDE"),
fixedModifications=c("Acetyl:N-term",
"Carbamidomethyl:C"))
## [1] 402.1209 533.1614 446.1107
## attr(,"sequence")
## [1] "ACE" "MACE" "CDE"
Fixed modification on a specific residual (e.g. just modify the 2nd K).
Variable modification (modification on the 1st and/or 2nd, and/or 3rd … K or none at all, becomes more complicated for multiple modifications, e.g. up to 3 modifications like methylation and/or acetylation on 1st/2nd/3rd/… K ).
K methyl 14.015650
K dimethyl 28.031300
K trimethyl 42.046950
Some modifications like Met-loss modify the sequence, e.g. remove M:
unimod:::.mass("MACE", fixedModifications="Met-loss:P-M")
## [1] 303.0889
## attr(,"sequence")
## [1] "ACE"
Currently all unimod modifications are in a data.frame
:
head(unimod::modifications)
## Id UnimodId Name Description Composition
## Acetyl:K Acetyl:K 1 Acetyl Acetylation H(2) C(2) O
## Acetyl:N-term Acetyl:N-term 1 Acetyl Acetylation H(2) C(2) O
## Acetyl:C Acetyl:C 1 Acetyl Acetylation H(2) C(2) O
## Acetyl:S Acetyl:S 1 Acetyl Acetylation H(2) C(2) O
## Acetyl:P-N-term Acetyl:P-N-term 1 Acetyl Acetylation H(2) C(2) O
## Acetyl:T Acetyl:T 1 Acetyl Acetylation H(2) C(2) O
## AvgMass MonoMass Site Position Classification
## Acetyl:K 42.0367 42.01056 K Anywhere Multiple
## Acetyl:N-term 42.0367 42.01056 N-term Any N-term Multiple
## Acetyl:C 42.0367 42.01056 C Anywhere Post-translational
## Acetyl:S 42.0367 42.01056 S Anywhere Post-translational
## Acetyl:P-N-term 42.0367 42.01056 N-term Protein N-term Post-translational
## Acetyl:T 42.0367 42.01056 T Anywhere Post-translational
## SpecGroup NeutralLoss LastModified Approved Hidden
## Acetyl:K 1 FALSE 2017-11-08 16:08:56 TRUE FALSE
## Acetyl:N-term 2 FALSE 2017-11-08 16:08:56 TRUE FALSE
## Acetyl:C 3 FALSE 2017-11-08 16:08:56 TRUE TRUE
## Acetyl:S 4 FALSE 2017-11-08 16:08:56 TRUE TRUE
## Acetyl:P-N-term 5 FALSE 2017-11-08 16:08:56 TRUE FALSE
## Acetyl:T 6 FALSE 2017-11-08 16:08:56 TRUE TRUE
… and the user just needs to specify the Id:
unimod:::.mass("KKK", fixedModifications="Acetyl:K")
## [1] 510.3166
## attr(,"sequence")
## [1] "KKK"
To add custom modifications the user just need to supply a site (which
aminoacid should be modified) and the average and/or monoisotopic
mass of his modification (currently not supported but would easily fit
in the data.frame
structure):
For fixed modification on a specific site we would need an additional
column in that data.frame
, e.g. location that stores the index in the
peptide sequence that should be modified.
For variable modifications more columns would be useful: maximal number of allowed modifications (e.g. 3), maximal number of same modifications on one specific site (e.g. trimethylated K)
While fixed all/fixed specific modifications produce just one sequence and one mass result the output of variable modifications could have multiple sequences/mass.
While for the mass calculation it doesn’t matter if the 1st, 2nd, 3rd, …
K is modified (if there is just one modification) it matters for the
sequence (especially if you are interested in fragments like in
MSnbase
or topdownr
).
@pavel-shliaha these are use cases we are indeed considering. Thanks for the input.
@pavel-shliaha sorry for ignoring you input in my first comment:
- There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM
That's what I called a fixed and localized modification and is supported by the ModificationFixedLocalized
class in the current draft implementation.
- there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met
That's what I called a variable modification and should be supported by the ModificationVariable
class.
Let me start by providing a few more implementation details:
setClassUnion("CharOrInt", c("character", "integer"))
.Modifiction <- setClass("Modification",
slots = c(modification = "character",
n = "integer",
position = "CharOrInt"))
## Constructor, setting defaults and checking inputs
Modification <- function(mod, n = NA_integer_, pos = NA_integer_) {
if (missing(mod))
stop("Please defined your modification")
## check that mod is a valid modification
if (grepl("N-term", mod)) pos <- "N-term"
if (grepl("C-term", mod)) pos <- "C-term"
if (anyNA(n)) {
n <- NA_integer_
} else {
if (length(n) == 1 && n <= 0)
stop("Can't define an absent")
if (any(n < 0))
stop("Can't define a negative number of modifications")
}
.Modifiction(modification = mod, n = n, position = pos)
}
And now some definitions:
A variable modification is a modification can be present. n
must contain a 0.
A fixed modification must be present a defined number of time. n
must be of length 1.
I suggest we call a modification that can appear a variable number of times (including 0 time) a multiple modification. A variable modification is essentially a special case of a multiple modification, where the multiple is 0 or 1.
Note that a fixed modification can't be multiple, but it could carry a fixed number > 1 modification (i.e. n >= 1
below but must length(n) == 1
).
A modification can be anywhere or at a defined position, and I suggest we call the latter positional modification. A positional modification can be either fixed or variable or multiple. If defined, the position is either "N-term"
, "C-term"
or i
, where 1 <= i <= nchar(pepseq)
, when the peptide sequence is available.
modType <- function(object) {
type <- rep(FALSE, 4)
names(type) <- c("variable", "fixed", "multiple", "positional")
if (!is.na(object@position)) type["positional"] <- TRUE
if (!anyNA(object@n)) {
if (length(object@n) == 2 && identical(object@n, 0:1)) type["variable"] <- TRUE
if (length(object@n) == 1 && object@n > 0) type["fixed"] <- TRUE
if (length(object@n) > 1 && all(object@n > 0)) type["multiple"] <- TRUE
}
type
}
setMethod("show", "Modification",
function(object) {
type <- modType(object)
cat(object@modification, " is ",
paste(names(type[type]), collapse = ", "),
".\n", sep = "")
})
Demonstration:
> Modification("Acetyl:N-term")
Acetyl:N-term is positional.
> Modification("Acetyl:C-term", n = 2L)
Acetyl:C-term is fixed, positional.
> Modification("Acetyl:K", n = 0:1)
Acetyl:K is variable.
> Modification("Acetyl:C-term", n = 0:1)
Acetyl:C-term is variable, positional.
> Modification("Methyl:S", pos = 2L, n = 0:3)
Methyl:S is positional.
> Modification("Methyl:S", pos = 2L, n = 1L)
Methyl:S is fixed, positional.
> Modification("Methyl:S", pos = 2L, n = 0:1)
Methyl:S is variable, positional.
The examples I consider here are a simplification compared to what is defined in unimod
, hence I don't have all the if/else
cases as in the ModificationClass
branch classes. What I like is that we define the different types more formally, and we avoid a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered). But again, I don't consider all cases, so may be this won't scale.
@sgibb what do you think?
a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered).
That's possible true.
While your current approach looks very simple it would work for most of the cases I guess. And I really like your fixed, variable, multiple, positional-approach.
But what if the user wants a modification of the second K? (your @position
is my @globalIndex
but what about the @siteIndex
?)
I don't really like the CharOrInt
definition for position
because you will always have to use if (is.numeric(...))
or if (is.character(x@position) && x@position == "N-term")
in all functions.
(OK you could argue that it doesn't matter if we frequently call if/else
or dispatching methods).
Indeed, I don't considered a siteIndex
type of input. I imagine it could be added.
But is this a real use case/requested feature?
Re the int/char slot, we could use 0 or 1 and Inf. What do you think of that?
Thinking back at this, I think we could drop the siteIndex
slot and replace it with a findIndex(pepseq, aa, i)
function (also findFirstIndex
and findLastIndex
) that returns that global/absolute position.
But is this a real use case/requested feature?
I more or less assumed it because of this comment: https://github.com/lgatto/MSnbase/issues/167#issuecomment-256629619
Re the int/char slot, we could use 0 or 1 and Inf.
Mh, that would just change if(is.character(x@position) && x@position == "C-term")
into if(!is.finite(x@position))
. So your first approach (char/int) is much easier to read and to interpret. So keep it like that.
I agree with your:
setClass("Modification",
slots = c(modification = "character",
position = "integer",
n = "integer"))
It is much simpler than my class hierarchy and currently I can't think of an example where it wouldn't work.
Hi @sgibb - following up with our conversation about
unimod
at the EuroBioc2018 conference, I would suggest the following class (see also #6).where
modification
refers to a specific modification inunimod::modifications
.position
is an optionalinteger
defining where in a peptide sequence a modification is expected. Default would beNA
, for a modification to be present on any of the amino acids of that peptiden
is aninteger(0 - n)
where0:1
defines a variable modification,1
a fixed modification, and1 - n
the case where 1 to n modifications can be present (@pavel-shliaha's example of methyl, dimethyl, trimethyl)From there on we could build a `ModificationList. What do you think?
@tfkillian, who's starting to work with me, will have a go once we agree that this is the way forward.