levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Implement `charge_carrier` and `absolute` arguments for m/z calculation #61

Closed levitsky closed 2 years ago

levitsky commented 2 years ago

Closes #60. This is a suggested implementation of changes needed to support the discussed extensions. absolute is a new kwarg supported by Composition.mass() and calculate_mass transparently (this one is easy).

New charge_carrier argument, as well as charge, are now handled differently than charge used to be handled. They can stick to the Composition object or be passed to Composition.mass() (calculate_mass implements the second route). This creates quite a number of edge cases where checks and exceptions are needed; however, it allows reusing the same composition to calculate m/z of different charge states and/or with different charge carriers more easily.

Objections and critique are welcome. Is this approach (keeping charge and charge carrier as attributes and accepting them as kwargs for mass calculation) too complex and error-prone, or is it worth the convenience?

I will be following up with more test coverage for new behavior, but a quick demo for now:

In [4]: mass.calculate_mass(formula = 'C6H12O6', charge = -1, charge_carrier='Cl-')
Out[4]: -215.0322407822

In [5]: mass.calculate_mass(formula = 'C6H12O6', charge = -1, charge_carrier='Cl-', absolute=True)
Out[5]: 215.0322407822

In [6]: mass.calculate_mass(formula = 'C6H12O6', charge = -1, charge_carrier='HCOO-', absolute=True)
Out[6]: 225.06104237339002
levitsky commented 2 years ago

The problem seems to come from the fact that currently some of the ionization information is handled in Composition.__init__ (ion_type and charge), while some is also handled in Composition.mass (again, charge). Accepting ion_type in Composition.mass() is documented but apparently doesn't work.

With the implementation of charge_carrier the charge starts to affect the atomic composition in a more non-trivial way, which, to me, makes the case for doing it in Composition.__init__().

Here are the logical alternatives I can think of:

  1. Say that all Composition objects are "neutral", handle ions in mass() only. That's simple enough but affects the current __init__ signature because we would have to remove both charge and ion_type/ion_comp from accepted arguments.
  2. Say that charge and ion type are handled only during object creation and remove the corresponding arguments from mass(). This has the downside that we would have to create a nearly identical Composition object in order to calculate m/z for a different charge state.
  3. Try and keep things as flexible as possible, which this PR aims to do. Say that Composition can be charged or neutral. If charge is specified on creation, raise an error when attempting to calculate m/z with a different charge. If charge is not specified, calculate m/z for any charge. The downside here is that a lot of changes are needed: copy() and __reduce__() must retain charge information. Also, the fact that mass() without charge argument will return m/z for charged Composition might be unexpected and lead to misunderstanding.

I think I am more in favor of the third option, but I would appreciate input. @mobiusklein @mgleeming

P.S. This PR is not ready yet, at least one thing I know is that Composition does not preserve charge on pickling. P.P.S. All these variations don't really affect the behavior of calculate_mass() because it will encapsulate and mask the differences between the options.

mobiusklein commented 2 years ago

I'm inclined to say that everything is kept simpler by keeping charge and the charge carrier independent of the Composition type, as in your first proposal. As discussed in the issue, you can avoid complicating the infrastructure by instead treating the charge carrier as neutral (add or subtract however many charges' worth of protons to it) and then use the mass of a proton as the charge carrier and get the right answer. Also, less book keeping in all the copy/state saving methods.

This current approach also can't handle when you have a mix of charge carriers, e.g. two protons and a sodium cation, or worse, an iron cation adduct which may induce a +2 or +3 and a commensurate different change in mass and m/z (obligatory adduct calculator link: https://fiehnlab.ucdavis.edu/staff/kind/metabolomics/ms-adduct-calculator/). I've got a bunch in a single spectrum on the same molecule (4+ glycopeptide with four protons, three protons and ammonium, three protons and sodium, three protons and potassium, and an unknown number of protons and iron):

image

Does std_ion_comp have anything to do with charge? All the values are neutral losses, except perhaps the fancy c and z ions, but my lack of knowledge of the ExD intermediate radicals and the dissociation mechanism makes it hard for me to say. I think they're called "ion" compositions because we don't regularly refer to "b" and "y" fragments outside of their charged forms.

Currently, the Composition.mass implementation has to do extra book keeping when you pass it a charge and the composition already has H+ in it, which is extra work you may not want anyway. However, if you specify ion_comp and ion_type in the Composition.__init__, there is no "undo and overwrite" behavior in Composition.mass. I don't think we need to do away with these, but instead add an explicit charge_carrier param to Composition.mass, and have it determine:

  1. Do I have a charge and a charge carrier?
    • Do I already have an H+ count?
      • Yes: Throw an error or try to figure out how to fix
      • No: Directly compute correct m/z from neutral composition mass plus charge times charge carrier mass divided by abs(charge)
  2. Do I have an explicit charge carrier but no charge? If so, throw an error or assume singly charged
  3. Do I have an explicit charge but no charge carrier? If so, assume charge carrier is a proton and assume that the composition is neutral or already contains an H+ count and neutralize it before calculating m/z, current behavior but with abs(charge) in denominator instead of plain charge
  4. Do I have no charge related parameters? Do current behavior but with abs(charge) in denominator instead of plain charge.

If you're worried about breaking code that sign-fixes after the fact, I'd add the absolute parameter, and check whenever charge is negative and emit a warning about this behavior changing in a future version.

levitsky commented 2 years ago

Thank you for the input, @mobiusklein. You're most probably right that I don't have to tie ion_comp into this. As for mixing the charge carriers, I feel like neither implementation can easily support it and it's better to let the user deal with this.

The way I imagine this first approach is that we have a neutral Composition (with its neutral losses) and then we only add or subtract the charge carrier composition in mass(). Is that what you have in mind?

I'm not sure I understand what you mean by "add or subtract however many charges' worth of protons to it", "it" being the charge carrier, "and then use the mass of a proton as the charge carrier and get the right answer".

mobiusklein commented 2 years ago

Yes, I do think mass should be the only place we worry about charge explicitly. Until we work with a molecular graph, we don't need to deal with charge anywhere else.

As for the "adding or subtracting protons", suppose we've got a charge carrier like NH3H+, ammonium. It's ammonia (NH3) plus a proton (H+), alternatively written NH4+ but the parser doesn't understand that. If we want to compute the m/z of Alanine + NH4+, we start with:

# mass of alanine, C3H5N1O1
mass = 71.03711378470999
z = 1
# mass of ammonium, NH3H+
charge_carrier = 18.033825567780003
mz = (mass + z * charge_carrier) / abs(z)
print(mz)

mz is 89.07093935248999

Alternatively, we can get the same answer by moving the "uncharged" portion of the charge carrier into the base mass

# mass of alanine plus ammonia, ammonium - H+
mass = 71.03711378470999 + 17.02654910101
z = 1
# mass of a proton, H+
charge_carrier = 1.00727646677
mz = (mass + z * charge_carrier) / abs(z)
print(mz)

Again, mz 89.07093935248999.

The math gets a little more complicated when a single charge carrier carries multiple charges, but that's not the norm.

levitsky commented 2 years ago

I'm not sure I understand the benefit of this approach. What does this "neutral charged" composition represent? It seems to me that by subtracting hydrogens we get a composition that does not correspond to a real molecule. Since Composition can be used for things other than mass calculation, I expected that we'd want it to represent a specific substance/peptide in either charged or neutral form. In the neutral case, I thought we'd just build a composition without any knowledge of the charge carrier, then in mass() we'd get the charge and charge carrier information and apply them. What am I missing?

mobiusklein commented 2 years ago

I think I was trying to do something different from what you intended since I prefer to work entirely with neutral mass. Neutralizing the charge carrier is a work-around to allow you to use the simple formula for m/z, which makes fewer changes when going from neutral mass to composition than going from composition to m/z.

If you were to create a composition that is explicitly charged, whereby you can enumerate the number of charges and guarantee that the charge carriers are already factored into the formula, you could just do m / z without needing to add the mass of z protons before the division, and then it would avoid the problem of heterogenous charge carriers, at the cost of requiring greater specificity. This is difficult because it requires knowing if an element is contributing one or more charges while computing mass, and might make querying compositions trickier because "Na" in comp now needs to check for "Na", "Na+", "Na-", and worse for charge carriers like Ca which induce +2, or Fe which can induce +2 or +3. I think this overcomplicates the behaviors Composition is supposed to have beyond mass calculation, and for limited gain.

Your current implementation can handle non-H+ charge carriers correctly, but it assumes that all charges are from the same source. If you instead specified the charge carrier as a formula with a specific number of + or - in it, you would be able to circumvent this issue, but it means the charge isn't specified as an integer. Alternatively, you can assume that the charge is specified as either an integer and a homogenous charge carrier, or a charge carrier formula specifying the total charge, you could then support both use-cases. This would be by just omitting the multiplication step when setting the number of charge carriers if the charge carrier fully specifies the multiplicity of the charge. What neither of these support is changing the charge or the charge carrier after the fact without going through some other route than __setitem__.

Both the new implementation and the one I just described are more cumbersome, and require you to do extra work during all state transfers (copy, pickle), and lose benefits of inheriting those implementations directly from dict.

levitsky commented 2 years ago

Rewritten in accordance with the discussion above. Charge-related parameters are only accepted in mass(). Charge carrier can be specified as a formula or composition. The group's own charge can be specified separately or inside the string spec. The implementation only works when all charges are homogeneous (no different charge carriers), so the total charge must be a multiple of the adduct's charge.

P.S. The test failure seems unrelated and comes from usi.

mobiusklein commented 2 years ago

It looks like www.peptideatlas.org has either dropped that dataset or generally is re-arranging its website. Changing the backend seems to resolve the issue.

diff --git a/tests/test_usi.py b/tests/test_usi.py
index c3a8391..367e4ed 100644
--- a/tests/test_usi.py
+++ b/tests/test_usi.py
@@ -25,7 +25,7 @@ class USITest(unittest.TestCase):
 class PROXITest(unittest.TestCase):
     def test_request(self):
         usi_str = "mzspec:MSV000085202:210320_SARS_CoV_2_T:scan:131256"
-        response = proxi(usi_str, backend='peptide_atlas')
+        response = proxi(usi_str, backend='massive')

         assert set(usi_proxi_data.keys()) <= set(response.keys())