levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

[proforma] Make modification object hashable #62

Closed RalfG closed 2 years ago

RalfG commented 2 years ago

Hi @levitsky and @mobiusklein,

I have run into a few cases where it would be useful if proforma modification objects were hashable, for instance when compiling a set of unique modifications.

Am I correctly assuming that the only thing required to achieve this, is to add __hash__ and __eq__ methods to the ModificationBase class? Would hashing the modifications' id and provider be sufficient to identify a modification? If so, would like to open a PR for this.

Looking forward to hear your thoughts on this!

mobiusklein commented 2 years ago

Hello,

It might need to be a bit more complicated. ModificationBase is a subclass of TagBase, so it can have things like group_id or extra set, which need to be checked. [Oxidation#real], [Oxidation#nodeal], and [Oxidation#loc(0.68)] are currently different to __eq__, but for your use-case, they are the same.

Would a token-like object be acceptable, e.g. an object which hashes and equals the way you suggested, and can be used to produce a brand-new untainted copy of the ModificationBase instance? Then you could do something like {modification.key for modifications in ...} instead?

There is also a design that works as you describe, but it'd require more work scattered throughout all the places where TagBase instances interact with each other.

While working out just how different they are, I found a few bugs in the proforma module. I'll submit a PR, or can share the patch when its ready to work off as a base.

levitsky commented 2 years ago

Thank you so much @mobiusklein! @RalfG, how do you feel about #63, does it work for you?

RalfG commented 2 years ago

Works perfectly! Thanks @mobiusklein!

levitsky commented 2 years ago

Thank you. Feel free to reopen if anything comes up.

RalfG commented 8 months ago

Hi @mobiusklein,

Coming back to this issue, as the changes in #63 do not influence MassModification. Could that class be made hashable as well, perhaps simply by hasing its value (the mass shift)?

Best, Ralf

mobiusklein commented 8 months ago

I can do that, but hashing a floating point value is usually a bad idea unless that floating point value is a constant or derived from operations on constants. Or written another way:

import numpy as np

x = 57.08
print('-', x, hash(x))
for i in range(4, 19):
      y = x + 10.0 ** -i
      print(i, y, hash(y))
- 57.08 184467440737091641
4 57.0801 184698025038020665
5 57.08001 184490499167191097
6 57.080000999999996 184469746580095033
7 57.0800001 184467671321395257
8 57.08000001 184467463795523641
9 57.080000000999995 184467443042926649
10 57.0800000001 184467440967680057
11 57.080000000009996 184467440760143929
12 57.080000000001 184467440739401785
13 57.0800000000001 184467440737321017
14 57.080000000000005 184467440737108025
15 57.08 184467440737091641
16 57.08 184467440737091641
17 57.08 184467440737091641
18 57.08 184467440737091641

I'm afraid there's no "safe" way to do this without allowing two values which are the same for all intents and purposes but bit-level parity.

Even so, I tried to mitigate this this by deriving a "significant figures" to round things to, but you may get different results if you initialize the MassModification from a string or a float. I'll open a PR tomorrow.