introducing PTMs and internal fragments

pavel-shliaha commented 6 years ago

It seems we already have functionality to predict internal fragments, but is it implemented in topdownr? If not please implement it and I will start testing on our datasets.

About PTMs. We need to be able to do:

1) specify PTMs on particular residues

a) PTM on a particular residue (e.g. trimethylation of Lys 4) b) PTM on all residues of a particular type. (e.g. trimethylation of all Lys)

b) is a subset of b). Thus we need to implement a solution that allows to specify any Lys to have this modification.

2) Specify PTMs as both fixed or variable. Variable means that we should be able to get fragments with and without modifications. So fixed K4me3 means that C- terminal fragments 4, 5, 6, etc all have this modification. Variable K4me3 means that two sets of C- terminal fragments have to be generated: 4, 5, 6,etc that have this modification and 4, 5, 6, etc that dont.

3) PTMs should have cummulative effect: If I specify 2 PTMs as variable, then certain fragments will have 4 variants

-+ +- ++ where + designates PTM presence

sgibb commented 6 years ago

It seems we already have functionality to predict internal fragments, but is it implemented in topdownr? If not please implement it and I will start testing on our datasets.

It is partly implemented in MSnbase (but we were waiting for your review, so it is not in the official MSnbase version). Could you please reread the following issue: https://github.com/lgatto/MSnbase/issues/82 and answer there?

The PTM part should go into the to be written unimod package and we partly discussed it in this issue: https://github.com/ComputationalProteomicsUnit/unimod/issues/6

specify PTMs on particular residues

a) PTM on a particular residue (e.g. trimethylation of Lys 4) b) PTM on all residues of a particular type. (e.g. trimethylation of all Lys)

OK, I think this is quite easy. (I am not finally sure about a good user interface yet.)

Specify PTMs as both fixed or variable. Variable means that we should be able to get fragments with and without modifications. So fixed K4me3 means that C- terminal fragments 4, 5, 6, etc all have this modification. Variable K4me3 means that two sets of C- terminal fragments have to be generated: 4, 5, 6,etc that have this modification and 4, 5, 6, etc that dont.

Just to recap: fixed means all fragments have this modifications and variable means we should calculate a fragment with and without the modification.

PTMs should have cummulative effect: If I specify 2 PTMs as variable, then certain fragments will have 4 variants

That's fine for a few PTMs but because of growing with 2^n it would not be possible for a large number of modificaitons.

pavel-shliaha commented 6 years ago

I think it would be much simpler, if internal fragments were implemented in topdowr. This way I can really test it systematically and testing things systematically on thousands spectra, which will be more accurate than just looking at spectra one at time.

Yes your definition of fixed and variable is correct

About the growing number of PTMs. Its actually even worse, Lys can be modified by more than 5 PTMs, so if you specify variable you get 5^n

mbabovic commented 3 years ago

I used topdownr to record datasets for several glycoproteins. Having more PTMs and allowing variable modifications would be very useful there as it is possible to have backbone fragments that lost a part of or the entire glycan.

Are there any updates regarding this issue in MSnbase or the unimod package that could be implemented in topdownr? If there still isn't any elegant way of adding this functionality with other packages, could we implement something simple as an intermediate solution?

This seems to work for the proteins that I used (with up to 3 variable modifications):

1) Allowing the user to select "Custom" as modification in readTopDownFiles, in addition to the ones already defined 2) If this option is selected, each modification can be defined by its mass, position, and as variable or fixed. So the user would need to enter something like this:

readTopDownFiles(path,  modifications="Custom",
                          modificationMass=c(365.1322, 42.0106), 
                          modificationLocation=c(10,"N-term"),
                          modificationVariable=c(TRUE,FALSE),
                          conditions="ScanDescription")

3) Modifying calculateFragments function to loop over the custom modifications and add the mass and rename fragments that are changed. Modified fragments would be renamed so that for example we can have b15, b15_m1, b15_m2, and b15_m1_m2. This would allow filtering for the fragments that contain a specific modification later during data analysis. Renaming could be done just for variable modifications because if only fixed modifications are added there are no different masses for b15 (not counting neutral losses).

# x - data frame created by MSnbase::calculateFragments()
# m - mass of the modification
# p - position 
# v - is the modification variable
# seq - protein sequence
# i - order number of the modification
.modify<-function(x,m,p,v,seq,i){

  x1<-x
  l<-nchar(seq)

  #finding the fragments that can be modified and adding the modification mass

  if(p=="N-term") p<-0L
  if(p=="C-term") p<-l+1

  abc<-grepl("a|b|c", x$type) & (x$pos>=p)
  xyz<-grepl("x|y|z", x$type) & (l-x$pos)<p

  x$mz[abc|xyz]<-x$mz[abc|xyz]+m

  #modified fragments are renamed according to the modifications they carry 

  x$ion[abc|xyz]<-paste0(x$ion[abc|xyz], "_m", i)
  x$type[abc|xyz]<-paste0(x$type[abc|xyz], "_m", i)

  #if the modification is variable, the nonmodified fragments are included

  if (v) y<-dplyr::distinct(rbind(x, x1)) else  y<-x

  y
}

4) Redefining allowed types for nterm and cterm fragments in the ncbMap function so that the modified ones are counted when the coverage is calculated.

sgibb commented 3 years ago

Sorry for the late reply.

Are there any updates regarding this issue in MSnbase or the unimod package that could be implemented in topdownr?

Unfortunately not.

If there still isn't any elegant way of adding this functionality with other packages, could we implement something simple as an intermediate solution?

That would be possible of course.

This seems to work for the proteins that I used (with up to 3 variable modifications):

Thanks for your .modify suggestion. Are you running this multiple times (for each additional modification once)? Otherwise I don't understand how you get your _m1_m2 modifications.

Are you need many different modifications or would it be easier to add one or two new ones to the current predefined modifications?

mbabovic commented 3 years ago

That would be possible of course.

Thanks for the reply! This would be great

Thanks for your .modify suggestion. Are you running this multiple times (for each additional modification once)? Otherwise I don't understand how you get your _m1_m2 modifications.

Yes, running it for every modification.

Are you need many different modifications or would it be easier to add one or two new ones to the current predefined modifications?

I have about 15 new ones to add. Some of them are easy to implement in the same way that currently predefined modifications were added, but it becomes complicated for glycoproteins. Currently, the modification site is localized with a regular expression, and this is great for modifications where all residues that fit a certain pattern are modified. On the other hand, if only specific residues are modified, it requires the modification to be hardcoded to fit only that protein. For example, if a glycoprotein has 2 N-glycosylation sites with different glycan attached to each, defining these glycans as PTMs becomes complicated. They cannot be added to every asparagine residue, nor even to every residue that fits the N-glycosylation motif. And if it is hardcoded for the specific site on a specific protein, this PTM definition is not useful in other proteins. That is why I think having this "custom" option would be useful in addition to the predefined modifications.

sgibb commented 3 years ago

Sorry for the huge delay. There is a new branch where I implemented your suggestion. You could install it via:

devtools::install_github("sgibb/topdownr@cusmod")

readTopDownFiles gains a new argument customModifications which should be a data.frame with the columns mass, name, location and variable:

readTopDownFiles(
    ...
    customModifications = data.frame(
        mass = c(365.1322, 42.0106),
        name = c("M1", "Acetyl"),
        location = c(10, "N-term"),
        variable = c(TRUE, FALSE)
    ),
    ...
)

@mbabovic It would be great if you could test this. If you are satisfied with this solution I will push it to bioconductor.

Redefining allowed types for nterm and cterm fragments in the ncbMap function so that the modified ones are counted when the coverage is calculated.

.ncbMap now uses grepl("^a|^b|^c", fragmentTypes(x)) instead of fragmentTypes(x) %in% c("a", "b", "c") (similar for C term) to count modifications and neutral losses.

mbabovic commented 3 years ago

@sgibb Thank you! I've tested it and it works as expected.

sgibb commented 3 years ago

@mbabovic great! I pushed the version to bioc. Should be available tomorrow.

sgibb / topdownr

introducing PTMs and internal fragments #75

3) PTMs should have cummulative effect: If I specify 2 PTMs as variable, then certain fragments will have 4 variants