add LRU cache to structure matcher

materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.

https://pymatgen.org

Other

1.49k stars 857 forks source link

add LRU cache to structure matcher #4036

Closed kbuma closed 3 weeks ago

kbuma commented 3 weeks ago

Summary

This adds an LRU cache to the structure matcher, specifically to the StructureMatcher._get_reduced_structure function.

This suggested improvement is motivated by the performance of the AflowPrototypeMatcher._match_prototype function (and in turn the robocrys builder that uses it). We believe this will speed up that calculation by ~288x since the sometimes expensive calculation in StructureMatcher._get_reduced_structure will only need to be performed once per structure rather than the 288 times it is performed now.

DanielYang59 commented 3 weeks ago

Beautiful! That's an impressive performance gain!

I guess we should also be aware of the potential memory leakage issue for lru_cache when applied on methods (I believe a reference to the cls would also be held in the case of class methods, but I could be wrong here):

Using the functools.lru_cache and functools.cache decorators on methods can lead to memory leaks, as the global cache will retain a reference to the instance, preventing it from being garbage collected.

Also here is a more detailed explanation from Anthony Sottile .

I guess we could just make it a staticmethod to avoid this unless there's some reason to make it classmethod that I'm not aware of?

shyuep commented 3 weeks ago

I would move the method out of the class and just make it a method. This will resolve the memory leak issue (though I don't think it is a significant problem in our case).

shyuep commented 3 weeks ago

I would be very wary of the current hash function. That looks like an extremely expensive hash function. Furthermore, depending on transformations, this hash violates some basic rules on hashes. I would argue it is better to use the formula as a hash.

esoteric-ephemera commented 3 weeks ago

@shyuep, it could be expensive but my concern about using the formula is uniqueness - thinking about polymorphs, displacements, different magnetic configs

For transformations, shouldn't the hash change to reflect that the underlying object has changed? Maybe this is my misunderstanding of how hashes should apply to mutables

My thinking is: if I run a transformation on a structure to displace a site, the original hash should no longer apply to the transformed structure, right?

shyuep commented 3 weeks ago

@esoteric-ephemera The rule for hashes is that they do not have to be unique. Ultimately, the __eq__ function is what must be unique for different objects. A valid but bad hash function would be return 42. This is a common implementation in simple "Hello World" type tutorials for many languages.

See https://www.geeksforgeeks.org/what-are-hash-functions-and-how-to-choose-a-good-hash-function

Efficiency and minimizing collisions is the key. But collisions need not be 0 (that is what __eq__ is for). For a structure, the formula is a pretty good hash function - equal structures definitely have the same formula, and formula is cheap to compute. There will be collisions but they should be relatively minimal (polymorphs basically).

kbuma commented 3 weeks ago

@shyuep @esoteric-ephemera An alternative to adding a hash function to Structure that would work for this use case is to create an IStructure copy in the class method and call the new cached function with that copy. It would be something like this:

changing the class method to

@classmethod
def _get_reduced_structure(cls, struct: Structure, primitive_cell: bool = True, niggli: bool = True) -> Structure:
    immutable_struct = IStructure.from_dict(struct.as_dict())
    return _get_reduced_structure(immutable_struct, primitive_cell, niggli)

and the new function to

@lru_cache(maxsize=LRU_CACHE_SIZE)
def _get_reduced_structure(
  struct: IStructure, primitive_cell: bool, niggli: bool
) -> Structure:
  """Helper method to find a reduced structure."""
  if niggli:
      struct = struct.get_reduced_structure(reduction_algo="niggli")
  if primitive_cell:
      struct = struct.get_primitive_structure()
  return Structure.from_dict(struct.as_dict())

esoteric-ephemera commented 3 weeks ago

I think we have a good solution now that doesn't require hashing Structure, and the failing ruff stuff is unrelated to this PR -should be ready for review

shyuep commented 3 weeks ago

Merged. Thanks.

esoteric-ephemera commented 3 weeks ago

Could we push a release relatively soon (or a release candidate)? Would help with builder development on our side