levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Improve and document modX terminal group support #152

Open levitsky opened 2 weeks ago

levitsky commented 2 weeks ago

This PR is in response to user feedback on the mailing list: https://groups.google.com/g/pyteomics/c/X__Vjy_d6r8

The problem

  1. It is easy to add a modification to aa_comp from Unimod as a terminal modification, which produces incorrect compositions and masses, because terminal modifications should actually be specified by full group composition, while normal side-chain modifications are reduced by a hydrogen (as they represent a difference in compositions).
  2. This potential error is not documented.

The proposed solution

  1. Expand documentation and warn the user.
  2. Try to correct for this error: if a modification is specified with a mod label (not a term label), but used as a term label, a hydrogen is added automatically. Previously, this would just raise an exception because a term label would have to be present in aa_comp.
  3. Also (not directly related) add direct support for specifying a terminal group by its formula. This worked for fast_mass2, but now also works for Composition and calculate_mass.

Potential concerns Implicitly adding a hydrogen does not make sense for all modifications on Unimod (not all of them are even modifications). But then, it probably doesn't make sense to specify them as terminal, either.

Can we shoot ourselves in the foot by applying this implicit correction?

mobiusklein commented 2 weeks ago

There was a similar discussion with ProForma recently. The N- and or C-terminal losses are baked into many if not all of the terminal modifications in Unimod, but not for all modification definitions. This change appears to apply to the N-terminus, where you have -H, does the C terminal -OH need to be handled as well?

Since this doesn't alter behavior for programs that worked prior to the change barring abstract kwargs propagation, it adds new behavior, which isn't too dangerous. A little bit of testing suggests that terminal formulae won't have issues with trailing or leading - symbols either.

I think there should be a warnings.warn call when the implicit correction is applied, which should tell the user that their input is being altered so they know to specify the modification correctly in the future, and they can use the warnings filtering tools if they decide they need that auto-correction and don't want to see the warning anymore.

levitsky commented 2 weeks ago

Thank you for chipping in @mobiusklein!

This change appears to apply to the N-terminus, where you have -H, does the C terminal -OH need to be handled as well?

If my understanding is correct, this would depend on the Unimod logic regarding this modification's composition and mass, not on where exactly in the sequence the user is applying the modification. Conceivably, if a modification is strictly C-terminal, it would have "OH" subtracted rather than "H". If it's annotated as a side chain mod and the user applies it as C-terminal, though, the correction to apply is still "H". Does that make sense? If it does, I should look at Unimod and try to understand if some mods there are C-terminal and require "OH" correction. We would not have access to this metadata in Composition constructor anyway, so we could just guess based on where the mod is applied, but that makes this whole idea way riskier.

levitsky commented 1 week ago

After trying to look for C-terminal mods in Unimod, I have not found examples with -OH subtracted (which probably only means my search was weird), but I have seen enough evidence that my generalization may not be useful. As a matter of fact, applying just about anything from Unimod as "terminal group" instead of just a regular mod on a terminal residue is risky, and there is little we can do to fix it, other than change how the composition calculations work (always add H- and -OH on top). A warning is definitely justified when trying to use normal mod labels in terminal context, or in fact we could just as well raise an exception. None of the two would have helped with the OP's original issue, though, as they specifically assigned the composition for a terminal acetyl group to be that listed in Unimod, and intercepting that would be tricky.

levitsky commented 1 day ago

I rolled back item 2 in the proposed solution, trying to do this now raises a PyteomicsError. The exception will later have a URL to the notice in the docs about the difference between terminal groups and mod labels (after the updated doc is deployed).

Also, @mobiusklein apparently numpy 2.0 is now released and pynumpress doesn't import with it. Should it be addressed in pynumpress?

mobiusklein commented 1 day ago

I'll fix pynumpress, I'm guessing all the libraries that depend upon it at build/runtime are going to break similarly.