define variable modifications in calculateFragments()

adder commented 8 years ago

Hi, It seems you can only define fixed modifications with calculateFragments(). Am I right? If not, how to do it? I think a feature like this could be useful. Posible ways of implementing this are: 1) Provide a vector of masses/modifications with length = peptide sequence length. The position in the vector corresponds to the position in the peptide sequence

2) use lettercode in the peptide sequence and link the lettercode to a modification wit a names list. (much like fragmentpeptide() and PeptideSpectrum() in the OrgMassSpecR R package)

What do you guys think?

sgibb commented 8 years ago

Maybe I don't completely understand your request but at least your second suggestion is already possible:

library("MSnbase")

calculateFragments("ACE", modifications=c(C=57.02146))
# Modifications used: C=57.02146
#          mz ion type pos z seq
# 1  72.04439  b1    b   1 1   A
# 2 232.07504  b2    b   2 1  AC
# 3 361.11763  b3    b   3 1 ACE
# 4 148.06043  y1    y   1 1   E
# 5 308.09108  y2    y   2 1  CE
# 6 379.12819  y3    y   3 1 ACE
# 7 115.06278 y1_   y_   1 1   E
# 8 275.09343 y2_   y_   2 1  CE
# 9 346.13054 y3_   y_   3 1 ACE

calculateFragments("ACE", modifications=c(C=57.02146, A=-1))
# Modifications used: C=57.02146, A=-1
#          mz ion type pos z seq
# 1  71.04439  b1    b   1 1   A
# 2 231.07504  b2    b   2 1  AC
# 3 360.11763  b3    b   3 1 ACE
# 4 148.06043  y1    y   1 1   E
# 5 308.09108  y2    y   2 1  CE
# 6 378.12819  y3    y   3 1 ACE
# 7 115.06278 y1_   y_   1 1   E
# 8 275.09343 y2_   y_   2 1  CE
# 9 345.13054 y3_   y_   3 1 ACE

adder commented 8 years ago

If I get it correctly, calculateFragments("ACE", modifications=c(C=57.02146, A=-1)) still defines a fixed modification. (So all alanines in the peptide sequence (eg. ACAE) will have a -1 mass shift) Sometimes you want to define a variable modification (eg. only the second A is modified)

1) You could this by giving a modification vector with the modification giving at the postion in the vector matching the position in the sequence calculateFragments("ACAE", fixed.modifications=c(C=57.02146),variable.modifications=c(0,0,-1,0))

2) the OrgMassSpecR R package solved this by using a lettercode in the peptide sequence and link the lettercode to a modification with list. (see fragmentpeptide() and PeptideSpectrum() functions) You could this by giving a modification vector with the modification giving at the postion in the vector matching the position in the sequence eg. calculateFragments("ACAmE", fixed.modifications=c(C=57.02146),variable.modifications=c(m = -1))

A more biological relevant use case would be phosphorylation. Eg, serine can get phosporylated but not all serines in the protein will be phosporylated. Say that you have a spectra identified wiht a phosogroup at the 2nd serine in the sequence. It would be nice to be able to label the spectrum using the identified peptide sequence with calculateFragments() I hope I'm clear enough this time :)

sgibb commented 8 years ago

Ok, thanks for the clarification. Great idea. I like your second solution more than the first. For the first you have to know the length of the peptide sequence before (or count it manually).

sgibb commented 8 years ago

@lgatto we changed the way the modification argument was handled in MSnbase => 1.17.6. Currently the modification is added to the amino acid (before the modification replaced the amino acid).

In favor to handle the fixed.modifications and variable.modifications in the same way I would prefer to replace the original mass with the modification (I would use a single argument modification with upper letters for general/fixed amino acid modifications and lower letters for individual/variable modifications). I know this changes are not user-friendly. Or do you have another suggestion? I am sorry for breaking the API so often.

sgibb commented 8 years ago

Sry, we have to rethink the handling of modification once again. We changed to add instead of replace because we allow modifications to Cterm and Nterm: https://github.com/lgatto/MSnbase/issues/47#issuecomment-113902500

for Nterm/Cterm add would be useful (we don't know before which amino acid is at the N/C term, so replace would not be an option)
for variable modifications replace would be useful (otherwise we also need to know that e.g. p is the phosphorylated serine and have to calculate S + p which would require some preprocessing by the user)
for fixed modifications it doesn't really matter (currently add; we changed that in MSnbase 1.17.6; see comment above)

adder commented 8 years ago

Hi, Letters in the sequence to define modification are fine be me. :) Maybe some extra comments: 1) allow multi letter codes (so we are not restricted by the 24 letters in the alphabet to define PTMs) eg, SphosAMPSoxE to define phospo on S and oxidation on second S

2) Not sure what you mean by adding or replacing the original mass with modification mass. But it might be useful to know that modification masses are usually added to original masses. eg the modification defined in the unimod database are defined as mass differences. Not sure if it make a big difference on your code internally

3) Maybe something to consider, nterm en cterm modification might also be variable. If seen following notation in other software. (not sure if there is 'official nomenclature' for this) no modification H-SAMPLE-OH nterm acetylation ace-SAMPLE-OH But since you only have 1 nterm or cterm in peptide sequence, putting it as a fixed modification would work.

4) Finally, since you can work with mzid format in MSnBase. MZidentML also has an specific way to handle modified peptide sequence as defined in MZidentML guidelines. Actually you could use a PSM entry in the mzid format to calculate the fragments and label a spectra. But this is probably something less straightforward to add to the current functions short-term :)

lgatto commented 8 years ago

@sgibb - trying to catch up (still travelling, until next week).

A few thoughts

Following mzIdentML conventions, as mentioned by @adder in the post above could be a winning strategy in the long run. Something to keep in mind for the future, especially as calculateFragments seems to be a much used function, based on the feature requests we have.
I don't like the upper/lower case convention (but I could live with it). I don't think it is a clean interface (probably error prone) and it relies on messy parsing code to disentangle fixed from variable mods. What about having two independent arguments, fixed.modifications and variable.modifications? Or one that defines which mods are variable (fixed = TRUE would be the default, and a user would set something like fixed = c(TRUE, FALSE, TRUE, FALSE).
Finally, we could work on a new calculateFragments2 function, with a new API that supports everything we need, without worrying about backwards compatibility. When that one becomes stable, we can start a formal deprecation/replacement process.

adder commented 8 years ago

@lgatto A reply on your 2nd comment.

Or one that defines which mods are variable (fixed = TRUE would be the default, and a user would set something like fixed = c(TRUE, FALSE, TRUE, FALSE).

You also have to be able to specify where the variable modificiation is positioned. Eg. if you have a sequence with 2 serines. You can have a Phospo group on the first but not on the second serine.

lgatto commented 8 years ago

You also have to be able to specify where the variable modificiation is positioned. Eg. if you have a sequence with 2 serines. You can have a Phospo group on the first but not on the second serine.

Yes, of course. My point here is to handle variable and fixed modifications independently, rather than tangle them together using, for example upper and lower case. These two parameters could be could be lists with positions and masses, ... or anything that makes sense/is convenient in terms of user interface and coding.

@sgibb - what about a Modification class with slots aa, position, mass, ... We could then define a list of Modification (or a Modifications class) the user could choose from, or create their own, and then pass them to the calculateFragments function. I don't want us to over-engineer this, but it might be a reasonable possibility.

sgibb commented 8 years ago

Adding a new class would be surely the cleanest solution and even support unimod would be great but is that what the user really wants?

The mzIdentMl code is very similar to your class suggestion:

<Modification location="10" residues="K" monoisotopicMassDelta="138.0680796"> 
    <cvParam accession="XLMOD:02001" cvRef="XLMOD" name="DSS"/>
    <cvParam accession="MS:1002509" cvRef="PSI-MS" name="cross-link donor" value="11309529182388590588"/>
  </Modification>

Would be something like this pseudocode be more useable?

mod <- Modification(residues="C", location=2, avgMassDelta=57.02146)
calculateFragments("ACE", modifications=mod)

Or for the to be written unimod package:

library(unimod)
m <- unimod::query(id=4)
calculateFragments("ACE", modifications=mod)

sgibb commented 7 years ago

Sorry for being so slow and quiet regarding this topic. I think the best way would be to provide a Modification class (with unimod support). That's why I started (the long planned) unimod package. @adder and @lgatto you are welcome to contribute.

The planned class: https://github.com/ComputationalProteomicsUnit/unimod/blob/master/R/AllClasses.R (nothing implemented yet).

~BTW: @lgatto I wanted to develop it under the umbrella of the CPU but I don't have write access to the CPU/unimod repository. So I had to fork it. We could change this at any time.~

EDIT: urls updated to CPU

lgatto commented 7 years ago

Excellent initiative! I have now given you admin rights to the CPU repo.

yafeng commented 6 years ago

upvote +1 I am also interested to know if this pseudocode has been implemented already? mod <- Modification(residues="C", location=2, avgMassDelta=57.02146) calculateFragments("ACE", modifications=mod)

It can be very useful to define variable modifications when calculating fragment ions.

yafeng commented 6 years ago

For those people who wants to predict MS2 fragments with variable modifications, there is a temporary solution here. https://www.github.com/yafeng/SpectrumAI/ the function predict_MS2_spectrum is written in Spectra_functions.R script. it works like this: fragments = predict_ms2_spectrum("+229.163EGNNPAEN+0.985GDAK+229.163R", product_ion_charge = 1) "+229.163EGNNPAEN+0.985GDAK+229.163R" is the Peptide sequence output from MSGFplus. It only generates b, y ions (no neutral loss, up to charge state 2).

sgibb commented 6 years ago

@yafeng Unfortunately I had no time to implement it yet. I have to fix a few issues in MSnbase before. Next I would focus on unimod to provide a unique interface for modifications. Subsequently I am going to work on this issue. If you have interest in the unimod package help is very welcome.

higsch commented 5 years ago

Hi there, what is the current status of this issue? I recently had to implement mirror plotting of peptide spectra with variable modifications. E.g. only one out of several methionines oxidised. Is there any help needed?

sgibb commented 5 years ago

Unfortunately I have much to do with my real job. We started to implement the unimod modifications in the unimod package. Currently it can download and parse the data from https://unimod.org and can calculate the mass of given sequences with fixed modifications.

We would need this functionality in more than one package (MSnbase, Pbase, topdownr, ...). So your help would be very welcome. Please don't hesitate to create issues in the unimod repository if you want to help and/or need some explanation what we did.

lgatto / MSnbase

define variable modifications in calculateFragments() #167