Open snarfed opened 1 year ago
Would you be using cachetools for this? Does it just keep the cache in memory? Which is fine for this I guess.
We were using Redis cache with an appengine thing, but this was more to cache own app data, and it made sense to have it persist in the cache also when the instance shut down inbetween.
Probably yes, cachetools in memory, since it would be separate for each incoming HTTP request, no need to use a separate unified memory cache.
Actually https://requests-cache.readthedocs.io/ looks perfect for this!
This would also be a good opportunity to start using a request-local requests.Session
to pipeline/reuse connections per server.
Actually https://requests-cache.readthedocs.io/ looks perfect for this!
It has a bug that it can deadlock with threaded use, though? Or perhaps that is only with SQLite. https://github.com/requests-cache/requests-cache/issues/820
I also suspect it does not support good parallel use by doing the kind of request coalescing that we've discussed in that other issue. Was gonna check if it has any kind of dogpile handling in the code but didn't yet.
Update: I didn't figure that out with a quick glance in the code so asked it in a discussion there, https://github.com/orgs/requests-cache/discussions/903
Oh there's an answer, somehow I missed the notification, luckily checked now anyway.
I still need to digest it more. Any thoughts there, @snarfed ? https://github.com/orgs/requests-cache/discussions/903#discussioncomment-7558118
Hmm! Not yet. Honestly I'm not spending a lot of time on this, it's not yet a high priority. And as general with internal architectural changes, this wouldn't be the best project for a new contributor, since it would be spread across a number of repos and packages and be somewhat delicate to get right. #685 might be better, and there are lots of other much smaller and more contained Bridgy Fed issues that could definitely be better to start with!
Also this seems pretty orthogonal to request coalescing/dogpiling, right? Yes, we'd need these sessions to be thread safe, but that's true for all of our code in general, and we usually get it for free by sharing (almost) nothing across inbound requests.
Also this seems pretty orthogonal to request coalescing/dogpiling, right? Yes, we'd need these sessions to be thread safe,
No, this is actually directly about coalescsing, fixing dogpiling.
The mechanism is that there are locks for requests, the id of the lock being the full query with parameters. So if a duplicate request would be made, while the previous call for it is already underwway, it can't get the lock and thus does not do the new requests, but is blocked waiting that the lock will be released. Then when the first query is finished, and has put the result to the cache, the later requests for that same resource get past the lock, and get the result from the cache. So the requests are coalesced.
That's what the test there tests, https://github.com/aio-libs/aiocache/blob/7f6bbb438592e7a02b285c33a201b832ff65deca/tests/acceptance/test_lock.py#L55
Ah, I think we're talking about different things. For coalescing outbound HTTP requests, yes. I'm more interested in coalescing inbound HTTP requests though, eg a request that's expensive due to either db queries or generating keys for a new user or external requests or anything, and after handling it the first time, its response can be cached and reused or is otherwise less expensive.
I'm more interested in coalescing inbound HTTP requests though, eg a request that's expensive due to either db queries or generating keys for a new user or external requests or anything, and after handling it the first time, its response can be cached and reused or is otherwise less expensive.
Ah, I somehow got that wrong earlier.
In fact, what I did for the work thing back then was for inbound. The same approach works, just have locks using the request path+arguments as the key. So when the first request comes in, it ackquires the lock, and starts processing. When new requests come in the meantime, matching the same key, they get blocked at the lock and will stay waiting for it. When the processing for the first request is done, it writes the response data to the cache, and releases the lock. Then each later request gets to proceed to read the cache, as there is no lock anymore.
No worries! I probably confused this issue #685 and contributed to the confusion. This issue is indeed about caching outbound requests, #685 is for coalescing inbound.
The main problem here (in this issue) is making the same outbound GET request multiple times, in serial, within the same inbound request's processing. So the simplest first step here would probably be just a serialized request-local cache of outbound requests, without coalescing.
After that, we could definitely consider coalescing outbound requests across inbound requests, but I expect that happens less often right now, and coalescing inbound requests should hopefully subsume it.
Started looking into requests-cache just now, should work fine. I'm installing it in webutil/util.py
with:
from requests_cache import install_cache
install_cache(backend='memory', match_headers=['Accept'])
One catch is that I use unittest.mock
to mock out requests calls right now, which bypasses it. Ideally I'd need to change that so it's exercised in tests. https://requests-cache.readthedocs.io/en/stable/user_guide/compatibility.html says it works with both https://github.com/getsentry/responses and https://github.com/jamielennox/requests-mock , so maybe I'd try porting to one of those, probably responses.
^ tried a first pass at this, seemed promising, but broke in production. Saw a lot of IncompleteRead
s on both GETs and POSTs. Rolled it back for now.
('Connection broken: IncompleteRead(9172 bytes read, -4586 more expected)', IncompleteRead(9172 bytes read, -4586 more expected))
...
File "oauth_dropins/webutil/util.py", line 1673, in call
resp = getattr((session or requests), fn)(url, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 102, in get
return self.request('GET', url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 158, in request
return super().request(method, url, *args, headers=headers, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 205, in send
response = self._send_and_cache(request, actions, cached_response, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 233, in _send_and_cache
self.cache.save_response(response, actions.cache_key, actions.expires)
File "requests_cache/backends/base.py", line 89, in save_response
cached_response = CachedResponse.from_response(response, expires=expires)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/models/response.py", line 102, in from_response
obj.raw = CachedHTTPResponse.from_response(response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/models/raw_response.py", line 69, in from_response
_ = response.content # This property reads, decodes, and stores response content
^^^^^^^^^^^^^^^^
File "requests/models.py", line 899, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests/models.py", line 818, in generate
raise ChunkedEncodingError(e)
We duplicate outbound HTTP requests pretty badly right now. For example,
ActivityPubTest.test_inbox_reply_object
, a basic inbound AP reply to a web post, ends up fetching https://user.com/post three times: ActivityPub protocol discovery, Web protocol discovery, webmention endpoint discovery.We should add a per-request cache of outbound HTTP requests, at the
util.requests_*
level. Key would be URL + headers etc. That would be useful for Bridgy classic, granary, etc too.