Closed levitsky closed 2 years ago
The problem seems to come from the fact that currently some of the ionization information is handled in Composition.__init__
(ion_type
and charge
), while some is also handled in Composition.mass
(again, charge
). Accepting ion_type
in Composition.mass()
is documented but apparently doesn't work.
With the implementation of charge_carrier
the charge starts to affect the atomic composition in a more non-trivial way, which, to me, makes the case for doing it in Composition.__init__()
.
Here are the logical alternatives I can think of:
Composition
objects are "neutral", handle ions in mass()
only. That's simple enough but affects the current __init__
signature because we would have to remove both charge
and ion_type
/ion_comp
from accepted arguments.mass()
. This has the downside that we would have to create a nearly identical Composition object in order to calculate m/z for a different charge state.copy()
and __reduce__()
must retain charge information. Also, the fact that mass()
without charge argument will return m/z for charged Composition
might be unexpected and lead to misunderstanding.I think I am more in favor of the third option, but I would appreciate input. @mobiusklein @mgleeming
P.S. This PR is not ready yet, at least one thing I know is that Composition
does not preserve charge
on pickling.
P.P.S. All these variations don't really affect the behavior of calculate_mass()
because it will encapsulate and mask the differences between the options.
I'm inclined to say that everything is kept simpler by keeping charge and the charge carrier independent of the Composition
type, as in your first proposal. As discussed in the issue, you can avoid complicating the infrastructure by instead treating the charge carrier as neutral (add or subtract however many charges' worth of protons to it) and then use the mass of a proton as the charge carrier and get the right answer. Also, less book keeping in all the copy/state saving methods.
This current approach also can't handle when you have a mix of charge carriers, e.g. two protons and a sodium cation, or worse, an iron cation adduct which may induce a +2 or +3 and a commensurate different change in mass and m/z (obligatory adduct calculator link: https://fiehnlab.ucdavis.edu/staff/kind/metabolomics/ms-adduct-calculator/). I've got a bunch in a single spectrum on the same molecule (4+ glycopeptide with four protons, three protons and ammonium, three protons and sodium, three protons and potassium, and an unknown number of protons and iron):
Does std_ion_comp
have anything to do with charge? All the values are neutral losses, except perhaps the fancy c and z ions, but my lack of knowledge of the ExD intermediate radicals and the dissociation mechanism makes it hard for me to say. I think they're called "ion" compositions because we don't regularly refer to "b" and "y" fragments outside of their charged forms.
Currently, the Composition.mass
implementation has to do extra book keeping when you pass it a charge and the composition already has H+
in it, which is extra work you may not want anyway. However, if you specify ion_comp
and ion_type
in the Composition.__init__
, there is no "undo and overwrite" behavior in Composition.mass
. I don't think we need to do away with these, but instead add an explicit charge_carrier
param to Composition.mass
, and have it determine:
H+
count and neutralize it before calculating m/z, current behavior but with abs(charge) in denominator instead of plain chargeIf you're worried about breaking code that sign-fixes after the fact, I'd add the absolute
parameter, and check whenever charge is negative and emit a warning about this behavior changing in a future version.
Thank you for the input, @mobiusklein. You're most probably right that I don't have to tie ion_comp
into this. As for mixing the charge carriers, I feel like neither implementation can easily support it and it's better to let the user deal with this.
The way I imagine this first approach is that we have a neutral Composition
(with its neutral losses) and then we only add or subtract the charge carrier composition in mass()
. Is that what you have in mind?
I'm not sure I understand what you mean by "add or subtract however many charges' worth of protons to it", "it" being the charge carrier, "and then use the mass of a proton as the charge carrier and get the right answer".
Yes, I do think mass
should be the only place we worry about charge explicitly. Until we work with a molecular graph, we don't need to deal with charge anywhere else.
As for the "adding or subtracting protons", suppose we've got a charge carrier like NH3H+, ammonium. It's ammonia (NH3) plus a proton (H+), alternatively written NH4+ but the parser doesn't understand that. If we want to compute the m/z of Alanine + NH4+, we start with:
# mass of alanine, C3H5N1O1
mass = 71.03711378470999
z = 1
# mass of ammonium, NH3H+
charge_carrier = 18.033825567780003
mz = (mass + z * charge_carrier) / abs(z)
print(mz)
mz is 89.07093935248999
Alternatively, we can get the same answer by moving the "uncharged" portion of the charge carrier into the base mass
# mass of alanine plus ammonia, ammonium - H+
mass = 71.03711378470999 + 17.02654910101
z = 1
# mass of a proton, H+
charge_carrier = 1.00727646677
mz = (mass + z * charge_carrier) / abs(z)
print(mz)
Again, mz 89.07093935248999.
The math gets a little more complicated when a single charge carrier carries multiple charges, but that's not the norm.
I'm not sure I understand the benefit of this approach. What does this "neutral charged" composition represent?
It seems to me that by subtracting hydrogens we get a composition that does not correspond to a real molecule.
Since Composition
can be used for things other than mass calculation, I expected that we'd want it to represent a specific substance/peptide in either charged or neutral form. In the neutral case, I thought we'd just build a composition without any knowledge of the charge carrier, then in mass()
we'd get the charge and charge carrier information and apply them. What am I missing?
I think I was trying to do something different from what you intended since I prefer to work entirely with neutral mass. Neutralizing the charge carrier is a work-around to allow you to use the simple formula for m/z, which makes fewer changes when going from neutral mass to composition than going from composition to m/z.
If you were to create a composition that is explicitly charged, whereby you can enumerate the number of charges and guarantee that the charge carriers are already factored into the formula, you could just do m / z
without needing to add the mass of z
protons before the division, and then it would avoid the problem of heterogenous charge carriers, at the cost of requiring greater specificity. This is difficult because it requires knowing if an element is contributing one or more charges while computing mass, and might make querying compositions trickier because "Na" in comp
now needs to check for "Na", "Na+", "Na-"
, and worse for charge carriers like Ca
which induce +2, or Fe
which can induce +2 or +3. I think this overcomplicates the behaviors Composition
is supposed to have beyond mass calculation, and for limited gain.
Your current implementation can handle non-H+ charge carriers correctly, but it assumes that all charges are from the same source. If you instead specified the charge carrier as a formula with a specific number of + or - in it, you would be able to circumvent this issue, but it means the charge isn't specified as an integer. Alternatively, you can assume that the charge is specified as either an integer and a homogenous charge carrier, or a charge carrier formula specifying the total charge, you could then support both use-cases. This would be by just omitting the multiplication step when setting the number of charge carriers if the charge carrier fully specifies the multiplicity of the charge. What neither of these support is changing the charge or the charge carrier after the fact without going through some other route than __setitem__
.
Both the new implementation and the one I just described are more cumbersome, and require you to do extra work during all state transfers (copy, pickle), and lose benefits of inheriting those implementations directly from dict
.
Rewritten in accordance with the discussion above. Charge-related parameters are only accepted in mass()
. Charge carrier can be specified as a formula or composition. The group's own charge can be specified separately or inside the string spec. The implementation only works when all charges are homogeneous (no different charge carriers), so the total charge must be a multiple of the adduct's charge.
P.S. The test failure seems unrelated and comes from usi
.
It looks like www.peptideatlas.org has either dropped that dataset or generally is re-arranging its website. Changing the backend seems to resolve the issue.
diff --git a/tests/test_usi.py b/tests/test_usi.py
index c3a8391..367e4ed 100644
--- a/tests/test_usi.py
+++ b/tests/test_usi.py
@@ -25,7 +25,7 @@ class USITest(unittest.TestCase):
class PROXITest(unittest.TestCase):
def test_request(self):
usi_str = "mzspec:MSV000085202:210320_SARS_CoV_2_T:scan:131256"
- response = proxi(usi_str, backend='peptide_atlas')
+ response = proxi(usi_str, backend='massive')
assert set(usi_proxi_data.keys()) <= set(response.keys())
Closes #60. This is a suggested implementation of changes needed to support the discussed extensions.
absolute
is a new kwarg supported byComposition.mass()
andcalculate_mass
transparently (this one is easy).New
charge_carrier
argument, as well ascharge
, are now handled differently thancharge
used to be handled. They can stick to theComposition
object or be passed toComposition.mass()
(calculate_mass
implements the second route). This creates quite a number of edge cases where checks and exceptions are needed; however, it allows reusing the same composition to calculate m/z of different charge states and/or with different charge carriers more easily.Objections and critique are welcome. Is this approach (keeping charge and charge carrier as attributes and accepting them as kwargs for mass calculation) too complex and error-prone, or is it worth the convenience?
I will be following up with more test coverage for new behavior, but a quick demo for now: