snarfed / bridgy-fed

🌉 A bridge between decentralized social network protocols
https://fed.brid.gy
Creative Commons Zero v1.0 Universal
676 stars 34 forks source link

request-local cache of outbound HTTP requests #588

Open snarfed opened 1 year ago

snarfed commented 1 year ago

We duplicate outbound HTTP requests pretty badly right now. For example, ActivityPubTest.test_inbox_reply_object, a basic inbound AP reply to a web post, ends up fetching https://user.com/post three times: ActivityPub protocol discovery, Web protocol discovery, webmention endpoint discovery.

We should add a per-request cache of outbound HTTP requests, at the util.requests_* level. Key would be URL + headers etc. That would be useful for Bridgy classic, granary, etc too.

antont commented 1 year ago

Would you be using cachetools for this? Does it just keep the cache in memory? Which is fine for this I guess.

We were using Redis cache with an appengine thing, but this was more to cache own app data, and it made sense to have it persist in the cache also when the instance shut down inbetween.

snarfed commented 1 year ago

Probably yes, cachetools in memory, since it would be separate for each incoming HTTP request, no need to use a separate unified memory cache.

snarfed commented 1 year ago

Actually https://requests-cache.readthedocs.io/ looks perfect for this!

snarfed commented 1 year ago

This would also be a good opportunity to start using a request-local requests.Session to pipeline/reuse connections per server.

antont commented 1 year ago

Actually https://requests-cache.readthedocs.io/ looks perfect for this!

It has a bug that it can deadlock with threaded use, though? Or perhaps that is only with SQLite. https://github.com/requests-cache/requests-cache/issues/820

I also suspect it does not support good parallel use by doing the kind of request coalescing that we've discussed in that other issue. Was gonna check if it has any kind of dogpile handling in the code but didn't yet.

Update: I didn't figure that out with a quick glance in the code so asked it in a discussion there, https://github.com/orgs/requests-cache/discussions/903

antont commented 12 months ago

Oh there's an answer, somehow I missed the notification, luckily checked now anyway.

I still need to digest it more. Any thoughts there, @snarfed ? https://github.com/orgs/requests-cache/discussions/903#discussioncomment-7558118

snarfed commented 12 months ago

Hmm! Not yet. Honestly I'm not spending a lot of time on this, it's not yet a high priority. And as general with internal architectural changes, this wouldn't be the best project for a new contributor, since it would be spread across a number of repos and packages and be somewhat delicate to get right. #685 might be better, and there are lots of other much smaller and more contained Bridgy Fed issues that could definitely be better to start with!

snarfed commented 12 months ago

Also this seems pretty orthogonal to request coalescing/dogpiling, right? Yes, we'd need these sessions to be thread safe, but that's true for all of our code in general, and we usually get it for free by sharing (almost) nothing across inbound requests.

antont commented 11 months ago

Also this seems pretty orthogonal to request coalescing/dogpiling, right? Yes, we'd need these sessions to be thread safe,

No, this is actually directly about coalescsing, fixing dogpiling.

The mechanism is that there are locks for requests, the id of the lock being the full query with parameters. So if a duplicate request would be made, while the previous call for it is already underwway, it can't get the lock and thus does not do the new requests, but is blocked waiting that the lock will be released. Then when the first query is finished, and has put the result to the cache, the later requests for that same resource get past the lock, and get the result from the cache. So the requests are coalesced.

That's what the test there tests, https://github.com/aio-libs/aiocache/blob/7f6bbb438592e7a02b285c33a201b832ff65deca/tests/acceptance/test_lock.py#L55

snarfed commented 11 months ago

Ah, I think we're talking about different things. For coalescing outbound HTTP requests, yes. I'm more interested in coalescing inbound HTTP requests though, eg a request that's expensive due to either db queries or generating keys for a new user or external requests or anything, and after handling it the first time, its response can be cached and reused or is otherwise less expensive.

antont commented 10 months ago

I'm more interested in coalescing inbound HTTP requests though, eg a request that's expensive due to either db queries or generating keys for a new user or external requests or anything, and after handling it the first time, its response can be cached and reused or is otherwise less expensive.

Ah, I somehow got that wrong earlier.

In fact, what I did for the work thing back then was for inbound. The same approach works, just have locks using the request path+arguments as the key. So when the first request comes in, it ackquires the lock, and starts processing. When new requests come in the meantime, matching the same key, they get blocked at the lock and will stay waiting for it. When the processing for the first request is done, it writes the response data to the cache, and releases the lock. Then each later request gets to proceed to read the cache, as there is no lock anymore.

snarfed commented 10 months ago

No worries! I probably confused this issue #685 and contributed to the confusion. This issue is indeed about caching outbound requests, #685 is for coalescing inbound.

The main problem here (in this issue) is making the same outbound GET request multiple times, in serial, within the same inbound request's processing. So the simplest first step here would probably be just a serialized request-local cache of outbound requests, without coalescing.

After that, we could definitely consider coalescing outbound requests across inbound requests, but I expect that happens less often right now, and coalescing inbound requests should hopefully subsume it.

snarfed commented 9 months ago

Started looking into requests-cache just now, should work fine. I'm installing it in webutil/util.py with:

from requests_cache import install_cache
install_cache(backend='memory', match_headers=['Accept'])

One catch is that I use unittest.mock to mock out requests calls right now, which bypasses it. Ideally I'd need to change that so it's exercised in tests. https://requests-cache.readthedocs.io/en/stable/user_guide/compatibility.html says it works with both https://github.com/getsentry/responses and https://github.com/jamielennox/requests-mock , so maybe I'd try porting to one of those, probably responses.

snarfed commented 8 months ago

^ tried a first pass at this, seemed promising, but broke in production. Saw a lot of IncompleteReads on both GETs and POSTs. Rolled it back for now.

('Connection broken: IncompleteRead(9172 bytes read, -4586 more expected)', IncompleteRead(9172 bytes read, -4586 more expected))
...
File "oauth_dropins/webutil/util.py", line 1673, in call
  resp = getattr((session or requests), fn)(url, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 102, in get
  return self.request('GET', url, params=params, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 158, in request
  return super().request(method, url, *args, headers=headers, **kwargs)  # type: ignore
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests/sessions.py", line 589, in request
  resp = self.send(prep, **send_kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 205, in send
  response = self._send_and_cache(request, actions, cached_response, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/session.py", line 233, in _send_and_cache
  self.cache.save_response(response, actions.cache_key, actions.expires)
File "requests_cache/backends/base.py", line 89, in save_response
  cached_response = CachedResponse.from_response(response, expires=expires)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/models/response.py", line 102, in from_response
  obj.raw = CachedHTTPResponse.from_response(response)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests_cache/models/raw_response.py", line 69, in from_response
  _ = response.content  # This property reads, decodes, and stores response content
      ^^^^^^^^^^^^^^^^
File "requests/models.py", line 899, in content
  self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "requests/models.py", line 818, in generate
  raise ChunkedEncodingError(e)