Closed RalfG closed 2 months ago
Thank you @RalfG ! Do you think we can use the old utility function auxiliary.memoize
instead of a no-op dummy cache?
@mobiusklein could you also take a look at this?
P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver
subclasses automatically, without doing so in each class?
Did not realize there was a built-in version of lru_cache
in Pyteomics. I updated the code to always use that one. Just to be sure, I also set the maxsize to 10000. I don't expect anyone to ever get to that number.
P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver subclasses automatically, without doing so in each class?
Not sure on how to implement it like that, but feel free to modify the PR.
The auxiliary.memoize
decorator probably isn't suitable for use on an instance method. The cache would be shared across instances. So you'd have self
as part of the method signature being hashed, which only works because ModificationResolver
implements __eq__
and __hash__
. This is the same thing that happens with the Python 3 functools.cache
decorator (See https://docs.python.org/3/faq/programming.html#faq-cache-method-calls).
Ideally, you'd have the instance manage the cache itself, with as much of the heavy lifting done by the base class.
class ModificationResolver(object):
def __init__(self, name, **kwargs):
self.name = name.lower()
self.symbol = self.name[0]
self._database = None
self._cache = {}
...
# Move existing implementations of `resolve` to `_resolve_impl`
def _resolve_impl(self, name=None, id=None, **kwargs):
raise NotImplementedError()
# Manage the cache inside `resolve`
def resolve(self, name=None, id=None, **kwargs):
if self._cache is None:
return self._resolve_impl(name, id, **kwargs)
cache_key = (name, id, frozenset(kwargs.items()))
if cache_key in self._cache:
return self._cache[cache_key]
value = self._resolve_impl(name, id, **kwargs)
self._cache[cache_key] = value
return value
This involves no metaprogramming and is just a bit more efficient than memoize
too.
Otherwise, we could add a metaclass to ModificationResolver
that automatically wraps the resolve
method object in a memoize
during class body execution and deals with inheritance of the cache properly. That's a lot of code for not a lot of benefit, but makes it effortless forever after.
A third way would be to split the difference with Python's name resolution system and just do this:
class ModificationResolver(object):
def __init__(self, name, **kwargs):
self.name = name.lower()
self.symbol = self.name[0]
self._database = None
self.resolve = memoize(self.resolve)
Take a reference to the resolved instance method resolve
for self
, and replace it with an attribute that happens to be a callable object that wraps that function. Arguably less obscure than metaclasses, but is hard to read. It also doesn't kick in if the subclass doesn't call the base class __init__
method. The cache still hashes self
, but the cache isn't shared across instances.
In terms of compatibility, they have differences in "compatibility":
resolve
instead of _resolve_impl
may behave strangely.Which road seems best? I think letting the instance manage the cache is probably best, but it breaks things. The third option is hacky but cheap. The second option is much more work than the first, and if we're doing that, we may as well implement some kind of instance-aware caching anyway.
Thanks for the detailed answer, @mobiusklein! I don't have any hard preferences for either of the solutions. The first seems the most doable?
Thank you from me as well @mobiusklein , your analysis is always a ton of insight.
Although I would enjoy trying to get option 2 to work when I have time (which is next week), I can easily imagine shooting myself (or someone else) in the foot with it in the process. That being said, I don't exactly understand what you mean by cache inheritance that the metaclass code would have to handle. Wouldn't it essentially do what option 1 does, "renaming" (or "de-naming"?) the original resolve
method and replacing it with a caching version, operating on self._cache
?
Overall I think option 1 is the most straightforward and it allows creating subclassses that benefit from caching or subclasses that don't use it, so I would not object to just going with that if that seems to be the overall preference.
I had some other changes already in place for ProForma based upon the recent discussion in https://github.com/HUPO-PSI/ProForma/issues/6. These changes and the first approach I proposed are now in #148.
Thanks everyone, #148 is merged. Closing this one, but feel free to follow up as needed.
This PR adds caching to proforma modification resolvers.
In one of our use cases - reading PSMs from a search engine and recalculating theoretical masses - the same modifications (but different object instances) are resolved many times. Adding cache decorators to the resolvers massively speeds up this operation.
I assume there are no drawbacks to adding this. If lru_cache cannot be imported (Python < 3.2), a dummy decorator is defined. For now, I set maxsize to None, as there should be a finite list of modifications in most realistic use cases.