levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

[proforma] Use cache on modification resolvers #147

Closed RalfG closed 2 months ago

RalfG commented 2 months ago

This PR adds caching to proforma modification resolvers.

In one of our use cases - reading PSMs from a search engine and recalculating theoretical masses - the same modifications (but different object instances) are resolved many times. Adding cache decorators to the resolvers massively speeds up this operation.

I assume there are no drawbacks to adding this. If lru_cache cannot be imported (Python < 3.2), a dummy decorator is defined. For now, I set maxsize to None, as there should be a finite list of modifications in most realistic use cases.

levitsky commented 2 months ago

Thank you @RalfG ! Do you think we can use the old utility function auxiliary.memoize instead of a no-op dummy cache?

@mobiusklein could you also take a look at this?

P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver subclasses automatically, without doing so in each class?

RalfG commented 2 months ago

Did not realize there was a built-in version of lru_cache in Pyteomics. I updated the code to always use that one. Just to be sure, I also set the maxsize to 10000. I don't expect anyone to ever get to that number.

RalfG commented 2 months ago

P.S. Another unnecessary optimization idea: should we auto-wrap all resolve methods of ModificationResolver subclasses automatically, without doing so in each class?

Not sure on how to implement it like that, but feel free to modify the PR.

mobiusklein commented 2 months ago

The auxiliary.memoize decorator probably isn't suitable for use on an instance method. The cache would be shared across instances. So you'd have self as part of the method signature being hashed, which only works because ModificationResolver implements __eq__ and __hash__. This is the same thing that happens with the Python 3 functools.cache decorator (See https://docs.python.org/3/faq/programming.html#faq-cache-method-calls).

Ideally, you'd have the instance manage the cache itself, with as much of the heavy lifting done by the base class.

class ModificationResolver(object):
    def __init__(self, name, **kwargs):
        self.name = name.lower()
        self.symbol = self.name[0]
        self._database = None
        self._cache = {}
    ...
    # Move existing implementations of `resolve` to `_resolve_impl`
    def _resolve_impl(self, name=None, id=None, **kwargs):
        raise NotImplementedError()

    # Manage the cache inside `resolve`
    def resolve(self, name=None, id=None, **kwargs):
        if self._cache is None:
            return self._resolve_impl(name, id, **kwargs)
        cache_key = (name, id, frozenset(kwargs.items()))
        if cache_key in self._cache:
            return self._cache[cache_key]
        value = self._resolve_impl(name, id, **kwargs)
        self._cache[cache_key] = value
        return  value

This involves no metaprogramming and is just a bit more efficient than memoize too.

Otherwise, we could add a metaclass to ModificationResolver that automatically wraps the resolve method object in a memoize during class body execution and deals with inheritance of the cache properly. That's a lot of code for not a lot of benefit, but makes it effortless forever after.

A third way would be to split the difference with Python's name resolution system and just do this:

class ModificationResolver(object):
    def __init__(self, name, **kwargs):
        self.name = name.lower()
        self.symbol = self.name[0]
        self._database = None
        self.resolve = memoize(self.resolve)

Take a reference to the resolved instance method resolve for self, and replace it with an attribute that happens to be a callable object that wraps that function. Arguably less obscure than metaclasses, but is hard to read. It also doesn't kick in if the subclass doesn't call the base class __init__ method. The cache still hashes self, but the cache isn't shared across instances.

In terms of compatibility, they have differences in "compatibility":

  1. Not a breaking change per se, but subclasses which call the base class resolve instead of _resolve_impl may behave strangely.
  2. Not a breaking change. Everything is handled by the type system, but it is probably hard to get right on the first try.
  3. Theoretically a breaking change, but the "breaking" is just that no caching is used by a subclass.

Which road seems best? I think letting the instance manage the cache is probably best, but it breaks things. The third option is hacky but cheap. The second option is much more work than the first, and if we're doing that, we may as well implement some kind of instance-aware caching anyway.

RalfG commented 2 months ago

Thanks for the detailed answer, @mobiusklein! I don't have any hard preferences for either of the solutions. The first seems the most doable?

levitsky commented 2 months ago

Thank you from me as well @mobiusklein , your analysis is always a ton of insight.

Although I would enjoy trying to get option 2 to work when I have time (which is next week), I can easily imagine shooting myself (or someone else) in the foot with it in the process. That being said, I don't exactly understand what you mean by cache inheritance that the metaclass code would have to handle. Wouldn't it essentially do what option 1 does, "renaming" (or "de-naming"?) the original resolve method and replacing it with a caching version, operating on self._cache?

Overall I think option 1 is the most straightforward and it allows creating subclassses that benefit from caching or subclasses that don't use it, so I would not object to just going with that if that seems to be the overall preference.

mobiusklein commented 2 months ago

I had some other changes already in place for ProForma based upon the recent discussion in https://github.com/HUPO-PSI/ProForma/issues/6. These changes and the first approach I proposed are now in #148.

levitsky commented 2 months ago

Thanks everyone, #148 is merged. Closing this one, but feel free to follow up as needed.