karpetrosyan / hishel

An elegant HTTP Cache implementation for HTTPX and HTTP Core.
https://hishel.com
BSD 3-Clause "New" or "Revised" License
175 stars 22 forks source link

Potential bug : Cache is not updated when a revalidation request is issued (AsyncTransport, but synchronous CacheTransport looks the same) #239

Closed hartym closed 4 months ago

hartym commented 4 months ago

Hello,

First, thanks for the excellent work on hishel, the codebase is a pleasure to work with.

I'm running into what I believe is an issue (but may very well be a misunderstanding from me).

My setup uses an asynchronous httpx client, and thus I'm also using the async hishel interface (although it looks that the synchronous interface use the same algorithm). hishel.Controller is configured with allow_heuristics=True and allow_stale=True (although I'm not sure that matters, as what I see seems to come from the transport implementation).

I'm making simple requests to an httpbin's "cache/10" endpoint, which is a simple endpoint that returns a response containing a cache-control: public, max-age=10 header.

The first request is made and the response is returned as-is by hishel (as expected), and the cache entry is created (as expected). The second request (before max-age expires) is returned from cache, all good too. But now if I wait for the max-age to expire, the next request will get revalidated (which is expected, I guess, or at least that looks logical to me). But this revalidation request, that gets once again the same response (max-age 10), is not stored into cache. All future requests will be revalidated too, making the cache for this cache key completely ignored forever from this point on.

One graphical representation of the sequence is the following (the little icon next to the request time shows whether or not the response has been sent from hishel cache, time goes from bottom to top:

image

I was able to "fix" the behaviour that I think is not what we expect by adding the following fast and dirty patch to the async transport:

diff --git a/hishel/_async/_transports.py b/hishel/_async/_transports.py
index 5d01274..82b9183 100644
--- a/hishel/_async/_transports.py
+++ b/hishel/_async/_transports.py
@@ -186,6 +186,15 @@ class AsyncCacheTransport(httpx.AsyncBaseTransport):
                 await revalidation_response.aclose()

                 assert isinstance(final_httpcore_response.stream, tp.AsyncIterable)
+
+                if self._controller.is_cachable(
+                    request=httpcore_request, response=final_httpcore_response
+                ):
+                    await self._storage.store(
+                        key,
+                        response=final_httpcore_response,
+                        request=httpcore_request,
+                    )
                 return await self._create_hishel_response(
                     key=key,
                     response=final_httpcore_response,

With the patch applied, I get the behaviour I'm expecting which is that one request gets out everytime the max-age expired, then it's served from cache for max-age seconds:

image

If you confirm that my understanding of hishel's behaviour is correct and most importantly that the obeserved behaviour is not correct, I can work on a a real patch (sync+async+matching tests).

Thanks a lot.

karpetrosyan commented 4 months ago

Hi! That's a good catch. I think it's not the expected behavior.

The controller has a method called handle_validation_response that looks like this:

    def handle_validation_response(self, old_response: Response, new_response: Response) -> Response:
        """
        Handles incoming validation response.

        This method takes care of what to do with the incoming
        validation response; if it is a 304 response, it updates
        the headers with the new response and returns it.

        This method mirrors the relevant section from RFC 9111,
        see (https://www.rfc-editor.org/rfc/rfc9111.html#name-handling-a-validation-respo).
        """
        if new_response.status == 304:
            headers = get_updated_headers(
                stored_response_headers=old_response.headers,
                new_response_headers=new_response.headers,
            )
            old_response.headers = headers
            return old_response
        else:
            return new_response

It is just preparing the response after getting a new one, and as the RFC suggests, it is updating the headers so you can hit the cache again without revalidation. So, I think I just missed the process of saving the prepared response into storage.

I think the solution is what you provided here, but I guess there is no need to check if the response is cacheable at that point in time.

Here are some references:

Also, I have tried to isolate this case, I think we can use something like this for the testing, though it looks ugly

from httpx import Request, Response

import hishel
from hishel import Controller
from hishel._utils import BaseClock, extract_header_values_decoded

class MockedClock(BaseClock):
    def __init__(self):
        self.mocked_timestamps = [
            1440504000,  # Mon, 25 Aug 2015 12:00:00 GMT
            1440504002,  # Mon, 25 Aug 2015 12:00:02 GMT
        ]

    def now(self) -> int:
        return self.mocked_timestamps.pop(0)

controller = Controller(clock=MockedClock())
storage = hishel.InMemoryStorage()

with hishel.MockTransport() as transport:
    transport.add_responses(
        [
            Response(
                200,
                headers=[(b"Cache-Control", b"max-age=1"), (b"Date", b"Mon, 25 Aug 2015 12:00:00 GMT")],
            ),
            Response(
                304,
                headers=[(b"Cache-Control", b"max-age=1"), (b"Date", b"Mon, 25 Aug 2015 12:00:02 GMT")],
            ),
        ]
    )
    with hishel.CacheTransport(transport=transport, controller=controller, storage=storage) as cache_transport:
        cache_transport.handle_request(Request("GET", "https://example.com/"))
        response = cache_transport.handle_request(Request("GET", "https://example.com/"))

        stored_response = storage.retrieve(response.extensions["cache_metadata"]["cache_key"])

        assert stored_response
        response, request, metadata = stored_response

        assert extract_header_values_decoded(response.headers, b"Date") == ["Mon, 25 Aug 2015 12:00:02 GMT"]
hartym commented 4 months ago

Nice, I'll use your suggestions to prepare a pull request if that's ok for you.

I think the solution is what you provided here, but I guess there is no need to check if the response is cacheable at that point in time.

I'm not sure about that one. What if the newly returned response from the server says that the new response is not cacheable ? I guess in most cases the server should stay consistent with its old opinion, but who knows ? Maybe it now sends a Cache-Control: no-cache (or a more subtle reason) ?

But maybe it's necessary to think of what to do if the new response is not cacheable ? Should the old cache be kept anyway ? When will it "expire" ?

karpetrosyan commented 4 months ago

You're right. If the 304 response contains something like Cache-Control: no-store, the cache entry should be removed. I have tested this behavior in Chromium and Firefox, and it seems they remove the cache entry if the 304 response requires it. I think we should check if the response is cacheable; if it is, we should update it. If not, we should entirely remove the cache entry.

I really appreciate your efforts. Please ping me if you need any help with the PR :pray:

But maybe it's necessary to think of what to do if the new response is not cacheable ? Should the old cache be kept anyway ? When will it "expire" ?

The is_cachable method simply checks if the response is allowed to be copied into the cache. Therefore, if it returns false, it means the copy of the response should not be stored. If it has already been stored, we should remove it.

From the RFC:

The no-store request directive indicates that a cache MUST NOT store any part of either this request or any response to it. This directive applies to both private and shared caches. "MUST NOT store" in this context means that the cache MUST NOT intentionally store the information in non-volatile storage and MUST make a best-effort attempt to remove the information from volatile storage as promptly as possible after forwarding it.

hartym commented 4 months ago

No problem with the effort, the choice of hishel was made for our code because you did package 99% of the effort, doing 0.1% here and there is really not much.

One detail before I get into the code: You say «the 304 response», do we agree that this is only one of the many cases that can happen, not even probably the most frequent ? Or I am missing something ?

To me, it will be the (304) case almost everytime for a (correctly implemented) static assets server (name your poison) but a lot of api servers or other dynamic servers will more often than not send a regular response, let's say a 200 (but can be anything). For example, my test cases with httpbin (golang implementation but not very relevant) will send a 200 with a max-age=x everytime the server is hit, and the revalidation request does not make an exception. Yes it's not modified, but no the server is not able to tell the client about it in a response and yet, the response (if cachable) looks to me like the new legitimate cacheable and to-be-cached version.

And correct me if I'm wrong, all the logic of building the new legitimate response, including the 304 special case that reuse the old cached version with new updated headers is already handled by hishel._controller.Controller.handle_validation_response. So the missing part is the storage.store(...) operation only.

Anyway, I'll start by working on some test cases and open a PR for eventual discussion, the implementation should be pretty obvious once this is ok.

karpetrosyan commented 4 months ago

304 responses are simply one of the two variations of conditional requests. If you look at the request headers, when the client tries to revalidate the response, it sends the request and adds the if-none-match or if-modified-since header, which roughly means...

Give me a normal response if the ETags are not the same, so we use the If-None-Match header; otherwise, give me a 304 response.

if if_none_match_header != actual_hash:
    return normal_response()
else:
    return not_modified_304_response()

As you said, Hishel fully supports the second case, when the server sends the not modified response, and it updates the original response headers as the RFC suggests. So, the missing part is only storage.store(...), yes.

my test cases with httpbin (golang implementation but not very relevant) will send a 200 with a max-age=x everytime the server is hit, and the revalidation request does not make an exception. Yes it's not modified, but no the server is not able to tell the client about it in a response and yet, the response (if cachable) looks to me like the new legitimate cacheable and to-be-cached version.

If the revalidation response returns 200, even when the resource was not modified, that makes the cache useless because you end up downloading the entire content unnecessarily. However, even if the server sends a 200 response, the cache must handle it correctly by assuming that the resource was changed and caching the response according to its headers. So yes, Hishel handles both cases; it just missed the part of saving the updated response into the storage.

hartym commented 4 months ago

Ok, that makes perfect sense.

One detail :

If the revalidation response returns 200, even when the resource was not modified, that makes the cache useless because you end up downloading the entire content unnecessarily. However, even if the server sends a 200 response, the cache must handle it correctly by assuming that the resource was changed and caching the response according to its headers. So yes, Hishel handles both cases; it just missed the part of saving the updated response into the storage.

I don't think it makes the cache useless (or at least, not for all cases) : although it does indeed force the revalidation request to download the entire content unnecessarily, that's only true if a revalidation request is necessary. In case of an expiration based cache (max-age, Expires header ...), I think that no revalidation is necessary until expiration, and thus the cache will be used effectively until then. And my (young) understanding of the hishel codebase (and also practical cases observation) makes me think this is perfectly handled already.

hartym commented 4 months ago

@karpetrosyan I opened early work about that in #240, few questions in the description. I'll update it based on your answers and once we're ok about the way to go, will do it in sync version too.

karpetrosyan commented 4 months ago

Seems like the test that I wrote is not what we actually need; now I am thinking about whether we really have such an issue or not.

The script that I provided for testing actually fails because of incorrect mocked timestamps. In that script, we are not encountering a stale response and we aren't even revalidating it... Therefore, we can simply delete the mocked 304 response and observe the same outcome.

Now, when fixing the script, I can't see any issues on our side. Here is the script that I am using.

from httpx import Request, Response

import hishel
from hishel import Controller
from hishel._utils import BaseClock, extract_header_values_decoded

class MockedClock(BaseClock):
    def __init__(self):
        self.mocked_timestamps = [
            1440504001,  # Mon, 25 Aug 2015 12:00:01 GMT
            1440504011,  # Mon, 25 Aug 2015 12:00:11 GMT
            1440504011,  # Mon, 25 Aug 2015 12:00:11 GMT
            1440504011,  # Mon, 25 Aug 2015 12:00:11 GMT
        ]

    def now(self) -> int:
        return self.mocked_timestamps.pop(0)

controller = Controller(clock=MockedClock())
storage = hishel.InMemoryStorage()

with hishel.MockTransport() as transport:
    transport.add_responses(
        [
            Response(
                200,
                headers=[(b"Cache-Control", b"max-age=1"), (b"Date", b"Mon, 25 Aug 2015 12:00:00 GMT")],
            ),
            Response(
                304,
                headers=[(b"Cache-Control", b"max-age=10"), (b"Date", b"Mon, 25 Aug 2015 12:00:02 GMT")],
            ),
        ]
    )
    with hishel.CacheTransport(transport=transport, controller=controller, storage=storage) as cache_transport:
        cache_transport.handle_request(Request("GET", "https://example.com/"))
        response = cache_transport.handle_request(Request("GET", "https://example.com/"))

        stored_response = storage.retrieve(response.extensions["cache_metadata"]["cache_key"])

        assert stored_response
        response, request, metadata = stored_response

        assert extract_header_values_decoded(response.headers, b"Date") == ["Mon, 25 Aug 2015 12:00:02 GMT"]
        assert extract_header_values_decoded(response.headers, b"Cache-Control") == ["max-age=10"]

        # w/o 304 cache hit
        response = cache_transport.handle_request(Request("GET", "https://example.com/"))
        print(response.extensions["from_cache"])

        # w/o 304 cache hit
        response = cache_transport.handle_request(Request("GET", "https://example.com/"))
        print(response.extensions["from_cache"])

        # w/o 304 cache hit
        response = cache_transport.handle_request(Request("GET", "https://example.com/"))
        print(response.extensions["from_cache"])

All future requests will be revalidated too, making the cache for this cache key completely ignored forever from this point on.

As you can see, I have mocked just two responses, but after the first revalidation, I am using cached responses without revalidating them.

karpetrosyan commented 4 months ago

I think we need an extension like revalidated. It would really help in such cases.

hartym commented 4 months ago

I agree, and this is why the test I provided is not the same as the one you provided (although I used your as a model on how to do the things, and the difference is a bit subtle).

First the MockedClock that use a list of timestamps to pop is quite unuseable for me, because you need to implement it with full awareness of the underlying implementation details (like how many times the controller will call the clock). It's why I used a MockedClock that starts at some time and requires manual incrementation. It allows to ignore whether or not the underlying implementation will "consume" a clock entry or not, making it a reliable testing tool even if the implementation changes later.

The test I wrote does 3 things in order :

I agree that the informations available in extensions has some limitations : in case of a max-age expired response for which the revalidation response is 200, we just get from_cache = False and for the same case with 304, we get from_cache = True (which is only partially true, because the cached response was updated). Having more metadata to understand what happened would be great, but I have no idea of what would be 100% correct.

(small note: I also fixed the timestamps in the test which did not match the iso string in headers and the test I wrote in the PR fails before patch and passes after patch. In fact, I wrote the failing test with the original code before applying the suggested patch, even if the fix was already ready).

hartym commented 4 months ago

Here is a simple failing test case:

from httpx import Request, Response

import hishel
from hishel import Controller
from hishel._utils import BaseClock, extract_header_values_decoded

class MockedClock(BaseClock):
    def __init__(self, mocked_timestamps):
        self.mocked_timestamps = mocked_timestamps

    def now(self) -> int:
        return self.mocked_timestamps.pop(0)

def test_with_200():
    controller = Controller(clock=MockedClock([
        1440504001,  # Mon, 25 Aug 2015 12:00:01 GMT
        1440504001,  # Mon, 25 Aug 2015 12:00:01 GMT
    ]))
    storage = hishel.InMemoryStorage()

    with hishel.MockTransport() as transport:
        transport.add_responses(
            [
                Response(
                    200,
                    headers=[(b"Cache-Control", b"max-age=1"), (b"Date", b"Mon, 25 Aug 2015 12:00:00 GMT")],
                ),
                Response(
                    200,
                    headers=[(b"Cache-Control", b"max-age=10"), (b"Date", b"Mon, 25 Aug 2015 12:00:01 GMT")],
                ),
            ]
        )
        with hishel.CacheTransport(transport=transport, controller=controller, storage=storage) as cache_transport:
            # Cache: miss, http 200, store
            response = cache_transport.handle_request(Request("GET", "https://example.com/"))
            assert not response.extensions["from_cache"]

            # (at 12:00:01) Cache: miss (expired), send revalidation, http 200, store (FAIL: did not store)
            response = cache_transport.handle_request(Request("GET", "https://example.com/"))
            assert not response.extensions["from_cache"]  # sends the new 200 after revalidation

            # (at 12:00:01) Cache: FAILS: should hit but misses, sends one request again (will repeat until end of time)
            response = cache_transport.handle_request(Request("GET", "https://example.com/"))
            assert response.extensions["from_cache"]  # because old cache + not modified validation sends the previous 200

(I think the behaviour is correct for 304s, the problem comes in when the remote server sends new content)

karpetrosyan commented 4 months ago

Oh, that's very useful, thanks. Now we truly know where the problem is and what we want to fix. It seems like 304 responses work because of this line

hartym commented 4 months ago

I don't understand why and how the cache storage entry gets updated (in case of 304), though, but I guess my understanding of this is not very relevant.