Closed klauer closed 2 years ago
I think this is a good approach, though I might put this logic in the _load_or_initialize()
method and leave load()
such that it always loads the database (for the sake of that method being intuitive). Maybe the method becomes something like _load_or_initialize_or_recall()
or something less wordy
We'd probably want something similar in the mongo backend, though we don't really use that currently right?
We'd probably want something similar in the mongo backend, though we don't really use that currently right?
In the mongo backend, the internal API won't be of the form "read the entire database file" as in the json, it will be more directly "hey server, please give me this specific document", so I suspect that it won't have the exact same problem as here for bulk reads- but maybe there will be other issues to consider separately.
The speed problem we have here comes from the backend reading the entire file to give us one key from it.
That is to say: I think it's definitely correct to do these optimizations per-backend rather than on the client.
RE: where to put the logic, I like Robert's suggestion to keep load
as-is for direct use (give me the whole raw database right now) and wrap/cache it in the functions that the searching logic uses to facilitate these "do lots of searches at once" cases, but I want to clarify that load
isn't even a required function for a backend so it's OK to change the behavior. No code outside the backend will be calling load
.
5.0s as the threshold will basically never get you in trouble (you'll "always" catch updates) while being very sufficient to cut down the db reads in a bulk chain search from thousands to exactly 1.
I'd have absolutely agreed to a merge as-is.
OK, I think this addresses your concerns and fixes the tests (locally at least). Marking as ready for review.
For each load, a deep copy is required to keep the functionality the same as before without further modifying the client and/or tests. This will come with a small penalty in our case on the order of 10ms per cached load:
In [1]: %timeit copy.deepcopy(db)
13.7 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I don't know how that number compares to what @ZLLentz was seeing with the filesystem access for each load. I would assume it's faster.
Docs build failing on missing qs backend
Previous operations were ~6ms per load, unfortunately- I should run the database load profiler on this branch. I need to get #264 finished and merged so it can be useful for checking performance updates.
OK, based on those performance numbers this will go back to draft then!
Yeah I think the deepcopy breaks the performance gains:
before:
Total time: 3.83411 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 59
Line # Hits Time Per Hit % Time Line Contents
==============================================================
59 def _load_or_initialize(self):
60 """Load an existing database or initialize a new one."""
61 608 0.2 0.0 0.0 try:
62 608 3833.9 6.3 100.0 return self.load()
63 except FileNotFoundError:
64 logger.debug('Initializing new database')
65
66 self.initialize()
67 return self.load()
after:
Total time: 70.6864 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 78
Line # Hits Time Per Hit % Time Line Contents
==============================================================
78 def _load_or_initialize(self):
79 """Load an existing database or initialize a new one."""
80 608 0.3 0.0 0.0 cache_invalid = (
81 608 0.5 0.0 0.0 self._last_loaded is None
82 607 1.7 0.0 0.0 or time.monotonic() - self._last_loaded >= self.cache_seconds
83 )
84
85 608 0.3 0.0 0.0 if cache_invalid:
86 15 0.0 0.0 0.0 try:
87 15 146.0 9.7 0.2 self._load_cache = self.load()
88 except FileNotFoundError:
89 logger.debug("Initializing new database")
90 self.initialize()
91 self._load_cache = self.load()
92 15 0.1 0.0 0.0 self._last_loaded = time.monotonic()
93
94 608 70537.5 116.0 99.8 return copy.deepcopy(self._load_cache)
I think the deepcopy scales with database size- we're basically copying the entire database in memory for every load
I think instead we could cache the raw JSON and reload that. Let me give that a try.
Locally:
In [9]: %timeit json.loads(client.backend._raw_json_cache)
3.01 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So we at least cut out the I/O operation and just require json to reprocess things - which is generally pretty quick.
When you get a chance, this look any better @ZLLentz ?
25% improvement over the original
Total time: 3.01955 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 86
Line # Hits Time Per Hit % Time Line Contents
==============================================================
86 def _load_or_initialize(self):
87 """Load an existing database or initialize a new one."""
88 608 2.4 0.0 0.1 if self.cache_is_valid and self._raw_json_cache is not None:
89 607 3009.5 5.0 99.7 return self._load_from_json(self._raw_json_cache)
90
91 1 0.0 0.0 0.0 try:
92 1 7.7 7.7 0.3 loaded = self.load()
93 except FileNotFoundError:
94 logger.debug("Initializing new database")
95 self.initialize()
96 loaded = self.load()
97 1 0.0 0.0 0.0 return loaded
That's a shame - is it even worth the additional complexity for 25% speed? I'm leaning toward "not really"
For each load, a deep copy is required to keep the functionality the same as before without further modifying the client and/or tests.
What internals are getting in the way here? I think if there was a way to only make the dict once we'd save even more time, making the complexity worthwhile.
What internals are getting in the way here? I think if there was a way to only make the dict once we'd save even more time, making the complexity worthwhile.
The one thing I do know is that the test suite breaks due to data being modified/removed from backend dictionaries: https://app.travis-ci.com/github/pcdshub/happi/jobs/573522303 That said, it is only one test that could be reworked.
I'm just not 100% sure of the implications of changing the query result data from being unshared+mutable to shared+mutable by downstream clients. Do we ever get client metadata in typhos and mutate it for our own purposes? hutch-python? Do others?
If we decide to go that route, it'd be worthwhile to do some integration (and interactive) testing, I think.
I see, totally bizarre that a test is mutating the dictionary- we can work around that probably. The further implications of caching a mutable dict are more interesting and need some thought:
load
isn't part of the backend API that Client
uses, so the dictionary should really only be passed around within the json backend.(I reverted the last 2 commits and fixed the test suite keeping the cached result as mutable/shared. I think the main target of this refactor should be a large speedup, and it won't come without addressing the things we're discussing now.)
- Is there any case where a consumer is using a backend directly? Note that
load
isn't part of the backend API thatClient
uses, so the dictionary should really only be passed around within the json backend.
They shouldn't be, that's for sure. But we don't go to great lengths to hide the backend either (client.backend
takes you right to it).
If we are to keep the cache and want to keep queries/invocations isolated from one another, I think the following (non-exhaustive list) would be good:
copy.deepcopy
for the client.find_document
return valuecopy.deepcopy
for any backend.find()
calls that filter back to the user (there's a few spots)
- Is making the dictionary immutable a sensible thing to do? Well, maybe. More thought...
I think that could be a good thing - and faster than the deepcopy
stuff above.
It'd also be a potentially breaking API change...
load
as-is and add a _cached_load
method that is used internally in the json backendEdit: on second thought functools.lru_cache
doesn't really address the re-load issues we settled on at the end. but maybe cachetools.func.ttl_cache
?
A really naive thought: did we try functools.lru_cache
? A really basic test:
In [7]: @lru_cache
...: def get_json(path):
...: with open(path, 'r') as f:
...: return json.load(f)
...:
In [8]: %%timeit -n 1
...: db = get_json(Path('/cds/home/r/roberttk/test/db.json'))
...:
The slowest run took 643.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1.13 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
running happi profile -d
Total time: 0.00982 s
File: /cds/home/r/roberttk/devrepos/happi/happi/backends/json_db.py
Function: _load_or_initialize at line 60
Line # Hits Time Per Hit % Time Line Contents
==============================================================
60 def _load_or_initialize(self):
61 """Load an existing database or initialize a new one."""
62 598 0.3 0.0 3.2 try:
63 598 9.5 0.0 96.8 return self.load()
64 except FileNotFoundError:
65 logger.debug('Initializing new database')
66
67 self.initialize()
68 return self.load()
I'm not sure if this is what zach was doing to benchmark things.
but maybe
cachetools.func.ttl_cache
?
I think the caching mechanism (whether hand-written as I had it a few commits back or imported from the standard library/some 3rd party library) isn't that important - or at the core of what we need to target here.
Recapping what we know:
json.loads()
on them for each backend.load()
call for a 25% speed increasedeepcopy()
for a performance penalty (in the case of our large database)I think we've largely come to the conclusion that we want the backend to be able to do (3), but we need to resolve how to deal with 3b in a way that is agreeable to existing happi consumers and the test suite.
I'd like to propose the following extension of the backend API:
clear_cache
on the base _Backend
class whose default implementation is a pass
rather than a NotImplementedError
clear_cache
before returning in the following functions:
find_document
_get_search_results
Backends that don't need caching can ignore this new method, backends that want to use it are free to do so. The purpose of the method is to allow the client class to clear the cache when it might send part of it to the end user, ensuring that the user can't accidentally mutate a backend internal.
We may also consider some combined search function on Client
where we can chain/intersect search
, search_regex
, and search_range
on the same cache rather than implementing this in downstream applications or in the cli layer as is done now
Updated the PR text based on the discussion + proposed changes. Marking as ready for review, though there may be more to touch on here.
Anecdotally, with the last commit (b822321) happi nearly loads instantaneously on atef. I'm officially 100% more excited for this.
Here's the same profile as above:
Total time: 0.0112806 s
File: C:\Users\zlentz\Documents\VSCode\happi\happi\backends\json_db.py
Function: _load_or_initialize at line 65
Line # Hits Time Per Hit % Time Line Contents
==============================================================
65 def _load_or_initialize(self):
66 """Load an existing database or initialize a new one."""
67 1 0.0 0.0 0.0 if self._load_cache is None:
68 1 0.0 0.0 0.0 try:
69 1 11.3 11.3 100.0 self._load_cache = self.load()
70 except FileNotFoundError:
71 logger.debug("Initializing new database")
72 self.initialize()
73 self._load_cache = self.load()
74
75 1 0.0 0.0 0.0 return self._load_cache
A staggering speedup that isn't helpfully expressed by a percentage!
It looks to me like the remaining db load time for the json db is largely spent in assembling the various HappiItem
and the associated EntryInfo
objects- not much low hanging fruit left for a json db with <1000 entries!
CI fails on flake8 of docs/release_notes.py I guess previously the flake8 must have been ignoring this whole folder
Oh also pre-release notes?
Description
client.search()
(and such) will hit the database once and cache internallyretain_cache_context
HappiItem
, it hits the database again with just the identifier, retrieving the same document prior to instantiation_get_item_from_document
refactor comes inMotivation and Context
Addresses #235
How Has This Been Tested?
Solely the test suite
Where Has This Been Documented?
Issue and PR