mattwthompson commented 7 months ago

Is your feature request related to a problem? Please describe.

Molecule.__hash__ uses (non-mapped) SMILES, which can cause hash collisions when otherwise identical molecules use different atom indices. This is almost never used, but some Python internals and downstream packages leverage objects' hashes to define uniqueness.

In [1]: from openff.toolkit import Molecule, __version__

In [2]: __version__
Out[2]: '0.14.3'

In [3]: {
   ...:     Molecule.from_mapped_smiles("[H:1][Cl:2]"): "forward",
   ...:     Molecule.from_mapped_smiles("[H:2][Cl:1]"): "backwards",
   ...: }
Out[3]: {Molecule with name '' and SMILES '[H]Cl': 'backwards'}

Here's a contrived example that demonstrates how @functools.lru_cache can misfire due to this hashing:

In [4]: import functools
   ...:
   ...:
   ...: @functools.lru_cache(None)
   ...: def first_atom_atomic_number(molecule: Molecule) -> int:
   ...:     return molecule.atom(0).atomic_number

In [5]: first_atom_atomic_number(Molecule.from_mapped_smiles("[H:1][Cl:2]"))
Out[5]: 1

In [6]: first_atom_atomic_number(Molecule.from_mapped_smiles("[H:2][Cl:1]"))
Out[6]: 1

In [7]: Molecule.from_mapped_smiles("[H:2][Cl:1]").atom(0).atomic_number
Out[7]: 17

Describe the solution you'd like

Simply use mapped SMILES.

Describe alternatives you've considered

My band-aid was to also pass a molecule's mapped SMILES to the function that's wrapped by lru_cache; my aim here is pretty much to make that necessary.

Additional context

A more involved case of this pathology causing Interchange charge assignment caching to map charges to incorrect atoms in some cases, causing critical failures in some fitting pipelines.

522 might be related but I don't think they're completely aligned

mattwthompson commented 7 months ago

An update from a discussion yesterday between @j-wags and I:

We agree that the current behavior is not-very-good, and also (with different weighting) that changing the current behavior will unfortunately be costly.

Changing it to mapped SMILES significantly lessens the probability of a hash collision, but not quite to zero. This has the downside of a behavior change - though argue this is a pure improvement and such a behavior change is justified. There are still some corner cases that could cause hash collisions, since i.e. conformers and other properties are not considered in this hash.

There's already some existing machinery for what one could accurately enough describe as a hash. See ordered_connection_table_hash for something that is quick, doesn't rely on an RDKit/OpenEye call, and should do a pretty good job of discerning molecules with relevant differences.

Notably in this case, overriding __eq__ and __hash__ are different behaviors, whereas in some other classes it is not. The == operator (are these two Molecule objects "the same"?) is a good place for ~~magic~~ domain-specific expertise, and this is why Molecule.__eq__ is effectively an isomorphism check with a couple of custom settings. There's a bit of a trade-off between being too specific and too loose with equality criteria, and there's not really a theoretically optimal choice here. Hashing asks a similar question, one that I'm not sure how to put in English, but one that is harmed by collisions (two objects with subtle differences producing the same hash) but nearly unaffected by misses (two objects with subtle but not particularly meaningful differences causing extra entries in a hash table). This is why I'm pushing for hashing behavior to change independent of the __eq__ method - a more aggressive hashing function would have prevented the original issue with negligible downside.

Chagning it to def __hash__(self): raise NotImplementedError has the benefit of not having problematic collisions, but probably comes with some downsides of breaking unknown things. Python internals (like lru_cache) rely heavily on hashing objects, and I suspect there would be numerous unintended consequences.

We didn't reach any conclusions, yet, and will need to draw on some more use cases to better understand the tradeoffs. I doubt too many people are calling hash(my_molecule) directly, but maybe something is using molecules as keys or doing things that have set[Molecule]s as intermediate states.

mattwthompson commented 4 months ago

Bumping this - I had a use case in which I needed to hash molecules (since indexing into a large topology is slow). Directly hashing the molecule was slow because of the conversion step, but Molecule.ordered_connection_table_hash() works pretty well.

openforcefield / openff-toolkit

Current `Molecule` hashing is dangerous #1772

522 might be related but I don't think they're completely aligned