pcdshub / happi

Heuristic Access to Positions of Photon Instruments
https://pcdshub.github.io/happi/master
Other
13 stars 29 forks source link

REF/PERF: caching mechanism for happi Client and backend databases #265

Closed klauer closed 2 years ago

klauer commented 2 years ago

Description

Motivation and Context

Addresses #235

How Has This Been Tested?

Solely the test suite

Where Has This Been Documented?

Issue and PR

tangkong commented 2 years ago

I think this is a good approach, though I might put this logic in the _load_or_initialize() method and leave load() such that it always loads the database (for the sake of that method being intuitive). Maybe the method becomes something like _load_or_initialize_or_recall() or something less wordy

We'd probably want something similar in the mongo backend, though we don't really use that currently right?

ZLLentz commented 2 years ago

We'd probably want something similar in the mongo backend, though we don't really use that currently right?

In the mongo backend, the internal API won't be of the form "read the entire database file" as in the json, it will be more directly "hey server, please give me this specific document", so I suspect that it won't have the exact same problem as here for bulk reads- but maybe there will be other issues to consider separately.

The speed problem we have here comes from the backend reading the entire file to give us one key from it.

That is to say: I think it's definitely correct to do these optimizations per-backend rather than on the client.

RE: where to put the logic, I like Robert's suggestion to keep load as-is for direct use (give me the whole raw database right now) and wrap/cache it in the functions that the searching logic uses to facilitate these "do lots of searches at once" cases, but I want to clarify that load isn't even a required function for a backend so it's OK to change the behavior. No code outside the backend will be calling load.

5.0s as the threshold will basically never get you in trouble (you'll "always" catch updates) while being very sufficient to cut down the db reads in a bulk chain search from thousands to exactly 1.

I'd have absolutely agreed to a merge as-is.

klauer commented 2 years ago

OK, I think this addresses your concerns and fixes the tests (locally at least). Marking as ready for review.

For each load, a deep copy is required to keep the functionality the same as before without further modifying the client and/or tests. This will come with a small penalty in our case on the order of 10ms per cached load:

In [1]: %timeit copy.deepcopy(db)
13.7 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I don't know how that number compares to what @ZLLentz was seeing with the filesystem access for each load. I would assume it's faster.

ZLLentz commented 2 years ago

Docs build failing on missing qs backend

Previous operations were ~6ms per load, unfortunately- I should run the database load profiler on this branch. I need to get #264 finished and merged so it can be useful for checking performance updates.

klauer commented 2 years ago

OK, based on those performance numbers this will go back to draft then!

ZLLentz commented 2 years ago

Yeah I think the deepcopy breaks the performance gains:

before:

Total time: 3.83411 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 59

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    59                                               def _load_or_initialize(self):
    60                                                   """Load an existing database or initialize a new one."""
    61       608          0.2      0.0      0.0          try:
    62       608       3833.9      6.3    100.0              return self.load()
    63                                                   except FileNotFoundError:
    64                                                       logger.debug('Initializing new database')
    65
    66                                                   self.initialize()
    67                                                   return self.load()

after:

Total time: 70.6864 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 78

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    78                                               def _load_or_initialize(self):
    79                                                   """Load an existing database or initialize a new one."""
    80       608          0.3      0.0      0.0          cache_invalid = (
    81       608          0.5      0.0      0.0              self._last_loaded is None
    82       607          1.7      0.0      0.0              or time.monotonic() - self._last_loaded >= self.cache_seconds
    83                                                   )
    84
    85       608          0.3      0.0      0.0          if cache_invalid:
    86        15          0.0      0.0      0.0              try:
    87        15        146.0      9.7      0.2                  self._load_cache = self.load()
    88                                                       except FileNotFoundError:
    89                                                           logger.debug("Initializing new database")
    90                                                           self.initialize()
    91                                                           self._load_cache = self.load()
    92        15          0.1      0.0      0.0              self._last_loaded = time.monotonic()
    93
    94       608      70537.5    116.0     99.8          return copy.deepcopy(self._load_cache)
ZLLentz commented 2 years ago

I think the deepcopy scales with database size- we're basically copying the entire database in memory for every load

klauer commented 2 years ago

I think instead we could cache the raw JSON and reload that. Let me give that a try.

klauer commented 2 years ago

Locally:

In [9]: %timeit json.loads(client.backend._raw_json_cache)
3.01 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So we at least cut out the I/O operation and just require json to reprocess things - which is generally pretty quick.

When you get a chance, this look any better @ZLLentz ?

ZLLentz commented 2 years ago

25% improvement over the original

Total time: 3.01955 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 86

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    86                                               def _load_or_initialize(self):
    87                                                   """Load an existing database or initialize a new one."""
    88       608          2.4      0.0      0.1          if self.cache_is_valid and self._raw_json_cache is not None:
    89       607       3009.5      5.0     99.7              return self._load_from_json(self._raw_json_cache)
    90
    91         1          0.0      0.0      0.0          try:
    92         1          7.7      7.7      0.3              loaded = self.load()
    93                                                   except FileNotFoundError:
    94                                                       logger.debug("Initializing new database")
    95                                                       self.initialize()
    96                                                       loaded = self.load()
    97         1          0.0      0.0      0.0          return loaded
klauer commented 2 years ago

That's a shame - is it even worth the additional complexity for 25% speed? I'm leaning toward "not really"

ZLLentz commented 2 years ago

For each load, a deep copy is required to keep the functionality the same as before without further modifying the client and/or tests.

What internals are getting in the way here? I think if there was a way to only make the dict once we'd save even more time, making the complexity worthwhile.

klauer commented 2 years ago

What internals are getting in the way here? I think if there was a way to only make the dict once we'd save even more time, making the complexity worthwhile.

The one thing I do know is that the test suite breaks due to data being modified/removed from backend dictionaries: https://app.travis-ci.com/github/pcdshub/happi/jobs/573522303 That said, it is only one test that could be reworked.

I'm just not 100% sure of the implications of changing the query result data from being unshared+mutable to shared+mutable by downstream clients. Do we ever get client metadata in typhos and mutate it for our own purposes? hutch-python? Do others?

If we decide to go that route, it'd be worthwhile to do some integration (and interactive) testing, I think.

ZLLentz commented 2 years ago

I see, totally bizarre that a test is mutating the dictionary- we can work around that probably. The further implications of caching a mutable dict are more interesting and need some thought:

  1. Is there any case where a consumer is using a backend directly? Note that load isn't part of the backend API that Client uses, so the dictionary should really only be passed around within the json backend.
  2. Is making the dictionary immutable a sensible thing to do? Well, maybe. More thought...
klauer commented 2 years ago

(I reverted the last 2 commits and fixed the test suite keeping the cached result as mutable/shared. I think the main target of this refactor should be a large speedup, and it won't come without addressing the things we're discussing now.)

  • Is there any case where a consumer is using a backend directly? Note that load isn't part of the backend API that Client uses, so the dictionary should really only be passed around within the json backend.

They shouldn't be, that's for sure. But we don't go to great lengths to hide the backend either (client.backend takes you right to it).

If we are to keep the cache and want to keep queries/invocations isolated from one another, I think the following (non-exhaustive list) would be good:

  1. a copy.deepcopy for the client.find_document return value
  2. copy.deepcopy for any backend.find() calls that filter back to the user (there's a few spots)
  • Is making the dictionary immutable a sensible thing to do? Well, maybe. More thought...

I think that could be a good thing - and faster than the deepcopy stuff above. It'd also be a potentially breaking API change...

ZLLentz commented 2 years ago
tangkong commented 2 years ago

Edit: on second thought functools.lru_cache doesn't really address the re-load issues we settled on at the end. but maybe cachetools.func.ttl_cache?

A really naive thought: did we try functools.lru_cache ? A really basic test:

In [7]: @lru_cache
   ...: def get_json(path):
   ...:     with open(path, 'r') as f:
   ...:         return json.load(f)
   ...:

In [8]: %%timeit -n 1
   ...: db = get_json(Path('/cds/home/r/roberttk/test/db.json'))
   ...:
The slowest run took 643.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1.13 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

running happi profile -d

Total time: 0.00982 s
File: /cds/home/r/roberttk/devrepos/happi/happi/backends/json_db.py
Function: _load_or_initialize at line 60

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    60                                               def _load_or_initialize(self):
    61                                                   """Load an existing database or initialize a new one."""
    62       598          0.3      0.0      3.2          try:
    63       598          9.5      0.0     96.8              return self.load()
    64                                                   except FileNotFoundError:
    65                                                       logger.debug('Initializing new database')
    66                                           
    67                                                   self.initialize()
    68                                                   return self.load()

I'm not sure if this is what zach was doing to benchmark things.

klauer commented 2 years ago

but maybe cachetools.func.ttl_cache?

I think the caching mechanism (whether hand-written as I had it a few commits back or imported from the standard library/some 3rd party library) isn't that important - or at the core of what we need to target here.

Recapping what we know:

  1. We can cache the raw JSON contents and call json.loads() on them for each backend.load() call for a 25% speed increase
  2. We can cache the deserialized JSON dictionary of the database and do a deepcopy() for a performance penalty (in the case of our large database)
  3. We can cache the deserialized JSON dictionary of the database for a massive performance boost a. Loads in effectively constant time regardless of the number of calls made within a 5 second period b. Caveat: passing back to the client/user the same mutable dictionary

I think we've largely come to the conclusion that we want the backend to be able to do (3), but we need to resolve how to deal with 3b in a way that is agreeable to existing happi consumers and the test suite.

ZLLentz commented 2 years ago

I'd like to propose the following extension of the backend API:

Backends that don't need caching can ignore this new method, backends that want to use it are free to do so. The purpose of the method is to allow the client class to clear the cache when it might send part of it to the end user, ensuring that the user can't accidentally mutate a backend internal.

ZLLentz commented 2 years ago

We may also consider some combined search function on Client where we can chain/intersect search, search_regex, and search_range on the same cache rather than implementing this in downstream applications or in the cli layer as is done now

klauer commented 2 years ago

Updated the PR text based on the discussion + proposed changes. Marking as ready for review, though there may be more to touch on here.

klauer commented 2 years ago

Anecdotally, with the last commit (b822321) happi nearly loads instantaneously on atef. I'm officially 100% more excited for this.

ZLLentz commented 2 years ago

Here's the same profile as above:

Total time: 0.0112806 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 65

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    65                                               def _load_or_initialize(self):
    66                                                   """Load an existing database or initialize a new one."""
    67         1          0.0      0.0      0.0          if self._load_cache is None:
    68         1          0.0      0.0      0.0              try:
    69         1         11.3     11.3    100.0                  self._load_cache = self.load()
    70                                                       except FileNotFoundError:
    71                                                           logger.debug("Initializing new database")
    72                                                           self.initialize()
    73                                                           self._load_cache = self.load()
    74
    75         1          0.0      0.0      0.0          return self._load_cache

A staggering speedup that isn't helpfully expressed by a percentage!

ZLLentz commented 2 years ago

It looks to me like the remaining db load time for the json db is largely spent in assembling the various HappiItem and the associated EntryInfo objects- not much low hanging fruit left for a json db with <1000 entries!

ZLLentz commented 2 years ago

CI fails on flake8 of docs/release_notes.py I guess previously the flake8 must have been ignoring this whole folder

tangkong commented 2 years ago

Oh also pre-release notes?