rformassspectrometry / unimod

Amino acid modifications for mass spectrometry
6 stars 0 forks source link

New unimod developments #7

Open lgatto opened 5 years ago

lgatto commented 5 years ago

Hi @sgibb - following up with our conversation about unimod at the EuroBioc2018 conference, I would suggest the following class (see also #6).

setClass("Modification",
         slots = c(modification = "character",
                   position = "integer",
                   n = "integer"))

where

From there on we could build a `ModificationList. What do you think?

@tfkillian, who's starting to work with me, will have a go once we agree that this is the way forward.

pavel-shliaha commented 5 years ago

A quick input from my collegues (not sure how this can be implemented):

1) There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM 2) there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met

sgibb commented 5 years ago

While I didn't work on this since the EuroBioc 2018 I now pushed my draft with the classes into the ModificationClass branch.

There I have a complex class hierarchy implemented:

Modification (VIRTUAL)
|- ModificationFixed # slot: @site, fixed modification, that is always applied on each occurence
|   |- ModificationFixedLocalized # slot: @globalIndex: apply mod if on the `globalIndex`th amino acid in the sequence, @siteIndex: apply mod if on the `siteIndex` occurence of the specific amino acid/pattern
|   |    `- ModificationVariable # slot: @globalMaxN, @siteMaxN: maximal occurrence in the whole peptide sequence or at a specific amino acid
|   |
|    `- ModificationFixedRegExp # slot @replacement: apply mod and replace amino acid
|
`- ModificationTerm (VIRTUAL)
     |- ModificationNterm # apply N-term modification (e.g. neutral loss)
     `- ModificationCterm # apply C-term modification

The user just needs to call the universal constructor UnimodModification("UnimodDescription") and the corresponding class is generated. Some modifications, e.g. *Met-loss+Acetyl:P-M" are actually combined modifications. That's why UnimodModification always returns a list. A simple example:

x <- "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"
m <- c(
    UnimodModification("Met-loss+Acetyl:P-M"),
    UnimodModification("Acetyl:K"),
    UnimodModification("Acetyl:N-term"),
    UnimodModification("Acetyl:S", fixed=FALSE, globalIndex=30, siteIndex=3:5) 
    # 30 40 49 54 70
)
m

sapply(m, class)
# [1] "ModificationFixedRegExp" "ModificationFixedRegExp"
# [3] "ModificationFixed"       "ModificationNterm"      
# [5] "ModificationVariable"   

lapply(m , .countSites, pepseq=x)
# [[1]]
# [1] 0
# 
# [[2]]
# [1] 0
# 
# [[3]]
# [1] 4
# 
# [[4]]
# [1] 1
# 
# [[5]]
# [1] 0 1 2 3 4 5

.pepseq(m[[1]], x)
# [1] "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"

deltaMass(m, x)
# [[1]]
# [1] 0
# 
# [[2]]
# [1] 0
# 
# [[3]]
# [1] 168.0423
# 
# [[4]]
# [1] 42.01056
# 
# [[5]]
# [1]   0.00000  42.01056  84.02113 126.03169 168.04226 210.05282

@tfkillian: Feel free to completely remove this draft or build upon it. Your are welcome to ask any question. I am happy to help and review your code if you wish.

(I wrote this at the train journey back home from the EuroBioc2018. I didn't touch it since then. Unfortunately there is no documentation and no unit tests in the ModificationClass branch.)


Some notes I had taken at the EuroBioc 2018 (some may be outdated or do not correspond to the draft in the ModificationClass branch.

2 use cases: modify mass and modify sequence (somehow coupled)

modify mass

unimod:::.mass("MACE",
               fixedModifications="Carbamidomethyl:C")

## [1] 491.1508
## attr(,"sequence")
## [1] "MACE"

unimod:::.mass(c("ACE", "MACE", "CDE"),
               fixedModifications=c("Acetyl:N-term",
                                    "Carbamidomethyl:C"))

## [1] 402.1209 533.1614 446.1107
## attr(,"sequence")
## [1] "ACE"  "MACE" "CDE"
K methyl 14.015650
K dimethyl 28.031300
K trimethyl 42.046950

modify sequence

Some modifications like Met-loss modify the sequence, e.g. remove M:

unimod:::.mass("MACE", fixedModifications="Met-loss:P-M")

## [1] 303.0889
## attr(,"sequence")
## [1] "ACE"

Problems

User Interface for Modification Input

Currently all unimod modifications are in a data.frame:

head(unimod::modifications)

##                              Id UnimodId   Name Description Composition
## Acetyl:K               Acetyl:K        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:N-term     Acetyl:N-term        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:C               Acetyl:C        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:S               Acetyl:S        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:P-N-term Acetyl:P-N-term        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:T               Acetyl:T        1 Acetyl Acetylation H(2) C(2) O
##                 AvgMass MonoMass   Site       Position     Classification
## Acetyl:K        42.0367 42.01056      K       Anywhere           Multiple
## Acetyl:N-term   42.0367 42.01056 N-term     Any N-term           Multiple
## Acetyl:C        42.0367 42.01056      C       Anywhere Post-translational
## Acetyl:S        42.0367 42.01056      S       Anywhere Post-translational
## Acetyl:P-N-term 42.0367 42.01056 N-term Protein N-term Post-translational
## Acetyl:T        42.0367 42.01056      T       Anywhere Post-translational
##                 SpecGroup NeutralLoss        LastModified Approved Hidden
## Acetyl:K                1       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:N-term           2       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:C                3       FALSE 2017-11-08 16:08:56     TRUE   TRUE
## Acetyl:S                4       FALSE 2017-11-08 16:08:56     TRUE   TRUE
## Acetyl:P-N-term         5       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:T                6       FALSE 2017-11-08 16:08:56     TRUE   TRUE

… and the user just needs to specify the Id:

unimod:::.mass("KKK", fixedModifications="Acetyl:K")

## [1] 510.3166
## attr(,"sequence")
## [1] "KKK"

To add custom modifications the user just need to supply a site (which aminoacid should be modified) and the average and/or monoisotopic mass of his modification (currently not supported but would easily fit in the data.frame structure):


For fixed modification on a specific site we would need an additional column in that data.frame, e.g. location that stores the index in the peptide sequence that should be modified.

For variable modifications more columns would be useful: maximal number of allowed modifications (e.g. 3), maximal number of same modifications on one specific site (e.g. trimethylated K)

User Interface for output

While fixed all/fixed specific modifications produce just one sequence and one mass result the output of variable modifications could have multiple sequences/mass.

While for the mass calculation it doesn’t matter if the 1st, 2nd, 3rd, … K is modified (if there is just one modification) it matters for the sequence (especially if you are interested in fragments like in MSnbase or topdownr).

lgatto commented 5 years ago

@pavel-shliaha these are use cases we are indeed considering. Thanks for the input.

sgibb commented 5 years ago

@pavel-shliaha sorry for ignoring you input in my first comment:

  1. There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM

That's what I called a fixed and localized modification and is supported by the ModificationFixedLocalized class in the current draft implementation.

  1. there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met

That's what I called a variable modification and should be supported by the ModificationVariable class.

lgatto commented 5 years ago

Let me start by providing a few more implementation details:

setClassUnion("CharOrInt", c("character", "integer"))

.Modifiction <- setClass("Modification",
                         slots = c(modification = "character",
                                   n = "integer",
                                   position = "CharOrInt"))

## Constructor, setting defaults and checking inputs
Modification <- function(mod, n = NA_integer_, pos = NA_integer_) {
    if (missing(mod))
        stop("Please defined your modification")
    ## check that mod is a valid modification
    if (grepl("N-term", mod)) pos <- "N-term"
    if (grepl("C-term", mod)) pos <- "C-term"
    if (anyNA(n)) {
        n <- NA_integer_
    } else {
        if (length(n) == 1 && n <= 0)
            stop("Can't define an absent")
        if (any(n < 0))
            stop("Can't define a negative number of modifications")
    }
    .Modifiction(modification = mod, n = n, position = pos)
}

And now some definitions:

modType <- function(object) {
    type <- rep(FALSE, 4)
    names(type) <- c("variable", "fixed", "multiple", "positional")
    if (!is.na(object@position)) type["positional"] <- TRUE
    if (!anyNA(object@n)) {
        if (length(object@n) == 2 && identical(object@n, 0:1)) type["variable"] <- TRUE
        if (length(object@n) == 1 && object@n > 0) type["fixed"] <- TRUE
        if (length(object@n) > 1 && all(object@n > 0)) type["multiple"] <- TRUE
    }
    type
}

setMethod("show", "Modification",
          function(object) {
              type <- modType(object)
              cat(object@modification, " is ",
                  paste(names(type[type]), collapse = ", "),
                  ".\n", sep = "")
          })

Demonstration:

> Modification("Acetyl:N-term")
Acetyl:N-term is positional.
> Modification("Acetyl:C-term", n = 2L)
Acetyl:C-term is fixed, positional.
> Modification("Acetyl:K", n = 0:1)
Acetyl:K is variable.
> Modification("Acetyl:C-term", n = 0:1)
Acetyl:C-term is variable, positional.
> Modification("Methyl:S", pos = 2L, n = 0:3)
Methyl:S is positional.
> Modification("Methyl:S", pos = 2L, n = 1L)
Methyl:S is fixed, positional.
> Modification("Methyl:S", pos = 2L, n = 0:1)
Methyl:S is variable, positional.

The examples I consider here are a simplification compared to what is defined in unimod, hence I don't have all the if/else cases as in the ModificationClass branch classes. What I like is that we define the different types more formally, and we avoid a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered). But again, I don't consider all cases, so may be this won't scale.

@sgibb what do you think?

sgibb commented 5 years ago

a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered).

That's possible true.

While your current approach looks very simple it would work for most of the cases I guess. And I really like your fixed, variable, multiple, positional-approach.

But what if the user wants a modification of the second K? (your @position is my @globalIndex but what about the @siteIndex?)

I don't really like the CharOrInt definition for position because you will always have to use if (is.numeric(...)) or if (is.character(x@position) && x@position == "N-term") in all functions. (OK you could argue that it doesn't matter if we frequently call if/else or dispatching methods).

lgatto commented 5 years ago

Indeed, I don't considered a siteIndex type of input. I imagine it could be added.

But is this a real use case/requested feature?

Re the int/char slot, we could use 0 or 1 and Inf. What do you think of that?

lgatto commented 5 years ago

Thinking back at this, I think we could drop the siteIndex slot and replace it with a findIndex(pepseq, aa, i) function (also findFirstIndex and findLastIndex) that returns that global/absolute position.

sgibb commented 5 years ago

But is this a real use case/requested feature?

I more or less assumed it because of this comment: https://github.com/lgatto/MSnbase/issues/167#issuecomment-256629619

Re the int/char slot, we could use 0 or 1 and Inf.

Mh, that would just change if(is.character(x@position) && x@position == "C-term") into if(!is.finite(x@position)). So your first approach (char/int) is much easier to read and to interpret. So keep it like that.

I agree with your:

setClass("Modification",
         slots = c(modification = "character",
                   position = "integer",
                   n = "integer"))

It is much simpler than my class hierarchy and currently I can't think of an example where it wouldn't work.