Closed kbuma closed 3 weeks ago
Beautiful! That's an impressive performance gain!
I guess we should also be aware of the potential memory leakage issue for lru_cache
when applied on methods (I believe a reference to the cls
would also be held in the case of class methods, but I could be wrong here):
Using the functools.lru_cache and functools.cache decorators on methods can lead to memory leaks, as the global cache will retain a reference to the instance, preventing it from being garbage collected.
Also here is a more detailed explanation from Anthony Sottile .
I guess we could just make it a staticmethod
to avoid this unless there's some reason to make it classmethod
that I'm not aware of?
I would move the method out of the class and just make it a method. This will resolve the memory leak issue (though I don't think it is a significant problem in our case).
I would be very wary of the current hash function. That looks like an extremely expensive hash function. Furthermore, depending on transformations, this hash violates some basic rules on hashes. I would argue it is better to use the formula as a hash.
@shyuep, it could be expensive but my concern about using the formula is uniqueness - thinking about polymorphs, displacements, different magnetic configs
For transformations, shouldn't the hash change to reflect that the underlying object has changed? Maybe this is my misunderstanding of how hashes should apply to mutables
My thinking is: if I run a transformation on a structure to displace a site, the original hash should no longer apply to the transformed structure, right?
@esoteric-ephemera The rule for hashes is that they do not have to be unique. Ultimately, the __eq__
function is what must be unique for different objects. A valid but bad hash function would be return 42
. This is a common implementation in simple "Hello World" type tutorials for many languages.
See https://www.geeksforgeeks.org/what-are-hash-functions-and-how-to-choose-a-good-hash-function
Efficiency and minimizing collisions is the key. But collisions need not be 0 (that is what __eq__
is for). For a structure, the formula is a pretty good hash function - equal structures definitely have the same formula, and formula is cheap to compute. There will be collisions but they should be relatively minimal (polymorphs basically).
@shyuep @esoteric-ephemera An alternative to adding a hash function to Structure that would work for this use case is to create an IStructure copy in the class method and call the new cached function with that copy. It would be something like this:
changing the class method to
@classmethod
def _get_reduced_structure(cls, struct: Structure, primitive_cell: bool = True, niggli: bool = True) -> Structure:
immutable_struct = IStructure.from_dict(struct.as_dict())
return _get_reduced_structure(immutable_struct, primitive_cell, niggli)
and the new function to
@lru_cache(maxsize=LRU_CACHE_SIZE)
def _get_reduced_structure(
struct: IStructure, primitive_cell: bool, niggli: bool
) -> Structure:
"""Helper method to find a reduced structure."""
if niggli:
struct = struct.get_reduced_structure(reduction_algo="niggli")
if primitive_cell:
struct = struct.get_primitive_structure()
return Structure.from_dict(struct.as_dict())
I think we have a good solution now that doesn't require hashing Structure
, and the failing ruff stuff is unrelated to this PR -should be ready for review
Merged. Thanks.
Could we push a release relatively soon (or a release candidate)? Would help with builder development on our side
Summary
This adds an LRU cache to the structure matcher, specifically to the StructureMatcher._get_reduced_structure function.
This suggested improvement is motivated by the performance of the AflowPrototypeMatcher._match_prototype function (and in turn the robocrys builder that uses it). We believe this will speed up that calculation by ~288x since the sometimes expensive calculation in StructureMatcher._get_reduced_structure will only need to be performed once per structure rather than the 288 times it is performed now.