Open mattwthompson opened 7 months ago
An update from a discussion yesterday between @j-wags and I:
We agree that the current behavior is not-very-good, and also (with different weighting) that changing the current behavior will unfortunately be costly.
Changing it to mapped SMILES significantly lessens the probability of a hash collision, but not quite to zero. This has the downside of a behavior change - though argue this is a pure improvement and such a behavior change is justified. There are still some corner cases that could cause hash collisions, since i.e. conformers and other properties are not considered in this hash.
There's already some existing machinery for what one could accurately enough describe as a hash. See ordered_connection_table_hash
for something that is quick, doesn't rely on an RDKit/OpenEye call, and should do a pretty good job of discerning molecules with relevant differences.
Notably in this case, overriding __eq__
and __hash__
are different behaviors, whereas in some other classes it is not. The ==
operator (are these two Molecule
objects "the same"?) is a good place for magic domain-specific expertise, and this is why Molecule.__eq__
is effectively an isomorphism check with a couple of custom settings. There's a bit of a trade-off between being too specific and too loose with equality criteria, and there's not really a theoretically optimal choice here. Hashing asks a similar question, one that I'm not sure how to put in English, but one that is harmed by collisions (two objects with subtle differences producing the same hash) but nearly unaffected by misses (two objects with subtle but not particularly meaningful differences causing extra entries in a hash table). This is why I'm pushing for hashing behavior to change independent of the __eq__
method - a more aggressive hashing function would have prevented the original issue with negligible downside.
Chagning it to def __hash__(self): raise NotImplementedError
has the benefit of not having problematic collisions, but probably comes with some downsides of breaking unknown things. Python internals (like lru_cache
) rely heavily on hashing objects, and I suspect there would be numerous unintended consequences.
We didn't reach any conclusions, yet, and will need to draw on some more use cases to better understand the tradeoffs. I doubt too many people are calling hash(my_molecule)
directly, but maybe something is using molecules as keys or doing things that have set[Molecule]
s as intermediate states.
Bumping this - I had a use case in which I needed to hash molecules (since indexing into a large topology is slow). Directly hashing the molecule was slow because of the conversion step, but Molecule.ordered_connection_table_hash()
works pretty well.
Is your feature request related to a problem? Please describe.
Molecule.__hash__
uses (non-mapped) SMILES, which can cause hash collisions when otherwise identical molecules use different atom indices. This is almost never used, but some Python internals and downstream packages leverage objects' hashes to define uniqueness.Here's a contrived example that demonstrates how
@functools.lru_cache
can misfire due to this hashing:Describe the solution you'd like
Simply use mapped SMILES.
Describe alternatives you've considered
My band-aid was to also pass a molecule's mapped SMILES to the function that's wrapped by
lru_cache
; my aim here is pretty much to make that necessary.Additional context
A more involved case of this pathology causing Interchange charge assignment caching to map charges to incorrect atoms in some cases, causing critical failures in some fitting pipelines.
522 might be related but I don't think they're completely aligned