Response caching - Githubissues

Turbo87 commented 6 years ago

For Node.js there is an HTTP client library (https://github.com/zkat/make-fetch-happen) that supports HTTP caching semantics and writes cached responses out to the disc by default. It would be wonderful if the same could be supported in reqwest as well. 🙏

DCjanus commented 4 years ago

I really like this idea, but I'm not sure which kind of API should be provided. As I know, HTTP cache control is for applications rather than HTTP clients.

For example: send a GET request with If-Match header, and remote response 304. Which kind response should be returned for reqwest.get("http://example.com/foo").await?

If it is 200, we lost information in the 304 response, and if 304, that's would be hard to find cached resource from reqwest.

DCjanus commented 4 years ago

As we know, check cache on disk might be slow for sometime, which means we should turn off this feature by default.

Maybe code should be like this:

let client = reqwest::Client::builder()
     .cache_control(true)
     .build()?;

let response = client.get("https://example.com/foo").send().await?; // not found in cache, send without any cache valid header.
println!("response {}", response.status().as_u16()); // output 'response 200'
let response = response.cached().await?; // do nothing but return origin response
println!("response {}", response.status().as_u16()); // output 'response 200'

let response = client.get("https://example.com/foo").send().await?; // because blablabla, send with `If-Match`.
println!("response {}", response.status().as_u16()); // output 'response 304'
let response = response.cached().await?; // load response from cache
println!("response {}", response.status().as_u16()); // output 'response 200'

kevincox commented 4 years ago

I think you are right about the caching. In some cases it is useful to know the original response. It might even make sense to have an API to discover that. However I think an opt-in cache would be very useful. It could be done in another crate but you would really want to mirror the whole reqwest API so it might make more sense to handle it here.

Caching is hard as well. For example there are use cases for RAM caches, disk caches even possibly remote storage caches. Even just cache expiry policy can lead to endless crates for a RAM cache alone. It probably makes sense to have a caching API built into reqwest. Maybe all backends could be separate creates (although it would be nice to have some "official" crates or common things like RAM and disk.

How about this as a starting point for the API? This is very prescriptive for the cache. The idea to to keep the HTTP logic inside reqwest and just pushing the storage and cleanup onto the cache implementation.

pub struct CacheKey;
pub struct CacheResponse<Response>;
pub struct CacheHeaders;

impl CacheResponse {
  pub fn new(headers: CacheHeaders, response: impl Future<Item=Resposne>) -> Self;
}

pub trait Cache {
  fn fetch(&mut self, key: CacheKey) -> impl Future<Option<impl CacheResponse>>;
  fn store(&mut self, key: CacheKey, headers: CacheHeaders, response: Response);
  fn expire(&mut self: key: CacheKey);
}

This API is very flexible for reqwest. It can manage things such as request reduplication, full handing of Cache-Control and other headers, fixing bad server dates race network with cache and more. However the cache knows very little about what it is storing. For example there should probably be a bunch of helper methods on CacheHeaders for things like stale date, expiry date and possibly others the help the cache made decisions. (For example a response with no Last-Modified or Etag header and Cache-Control: max-age=60 is useless after that minute is up.)

Another decision is the CacheResponse type. This may be better as a trait with things like etag() -> Option<Etag> so that an implementation could share the backing data between the headers and the response. This object forces them to be duplicated. However making a trait without forcing the implementation to understand what headers are important and how to parse them seemed very hard. (Example: Expiry time depends on Date, Age and Cache-Control or Expires headers.) Putting the important values into a CacheHeaders object seemed like a nice way to allow the implementation to keep these hotter values separate without having to figure that out themselves. (Exmaple: You could store the full response on disk and cache a table of CacheKey -> (CacheHeader, Path) in RAM). However something that will need to be added is a way to serialize and deserialize it. For RAM caches they may not need to but it still might be helpful if the byte layout is more compact than the Rust layout, and it will be critical for anything but in-process caches.

One other possible concern is that this gives no way for the cache to signal errors. The assumption is that cache failures will be treated as misses or will be dropped instead of stored.

And of course I haven't talked anything about the user facing api in this post.

kevincox commented 4 years ago

As we know, check cache on disk might be slow for sometime, which means we should turn off this feature by default.

I think having it off by default (at least until a future major version or similarly large update) makes sense. However I think it would make sense to have a mode where it was transparent to the caller. This sounds similar to the redirect problem as well. Maybe we would want to provide a helper that can provide the request history including caches and redirects (which can be intermingled of course).

Or we can skip all of this and say that if you want that info you better manage the caches and redirects yourself (with help from reqwest). In that case I can imagine three cache modes:

No caching, like today.
Full automatic caching. You get the "final" response with a synthetic 200.
Cache assistance. reqwest will add caching headers for revalidation and generate synthetic 304s if the local item is found in the cache. It is the user's job to take the 304 and fetch the response from the cache (with a helper like you described)

3 is very complicated. For example you would want to support eagerly fetching the item from cache so that you have the first buffer in RAM before the validation request gets back from the network. In this case you want to know if the user is going to call .cached() or not before making the request. So making 3 able to get full performance will be very, very difficult. 2 is much easier to add these optimizations to.

dfoxfranke commented 4 years ago

Here's how it seems to me the interface should look. First define a trait for the interface to the backing store:

#[async_trait]
trait HttpCache {
    async fn store(&mut self, res: Response) -> Result<Response>;
    async fn load(&mut self, url: &Url) -> Result<Option<Response>>;
}

Note the slightly funny signature of store(): it takes ownership of the Response, consumes it when its writes the body to its backing store, and then passes back a newly-constructed Response that reads its body from the store rather than from the network.

Provide in-memory-only and filesystem-backed HttpCache implementations. Users for whom this is insufficient can provide their own implementations that use a database or whatnot.

Extend ClientBuilder to allow specifying a cache:

impl ClientBuilder {
    pub fn cache(self, Box<dyn HttpCache> cache) -> ClientBuilder;
}

Change the behavior of Client::execute() so that it will first check to see if a fresh, cached response exists, and if so, immediately return it without touching the network. Add a force_execute() method to Client, which has the same signature as execute() but will always go to the network even if the cache is already fresh. Both variants should insert If-None-Match and If-Modified-Since headers into the request as appropriate, and if the response comes back as 304 Not Modified, then transparently merge the new headers returned with the 304 into the original cached response and return the merged response.

impl Client {
    pub fn force_execute(&self, request: Request) 
        -> impl Future<Output = Result<Response, Error>>;
}

Note the existence of https://crates.io/crates/http-cache-semantics, which should make life easier when implementing this.

kevincox commented 4 years ago

async fn load(&mut self, url: &Url) -> Result<Option<Response>>;

This isn't good enough. The cache key for HTTP is more than just the URL. It includes headers that are mentioned in Vary. We would want to abstract the key so that the cache stores don't need to know the details. (Although having those details might be helpful for expiry policies)

Change the behavior of Client::execute() so ...

I think the exact behaviour you have decsribed here won't be suitable for all use cases but I think that it makes sense as a start and knobs for the other behaviours can be added later.

I agree with the rest.

dfoxfranke commented 4 years ago

@kevincox I suppose you're right that request headers mentioned in Vary need to be stored somewhere, but I don't think that place is in the cache key. How then would lookups work? If we don't know what was in the Vary header in the cached response, then we don't know what request headers to include in the key. If we don't know what request headers to include in the key, then we can't look up the cached response to get the Vary header. I think the trait actually has to look like this:

#[async_trait]
trait HttpCache {
    async fn store(&mut self, req: &Request, res: Response) -> Result<Response>;
    async fn load(&mut self, url: &Url) -> Result<Option<(Request,Response)>>;
}

i.e., continue using the Url as the entire cache key, but store the request along with the response. Then, a necessary part of the cache validation logic is to verify for all request headers mentioned in the Vary field of the cached response, the cached request matches the new request.

kevincox commented 4 years ago

Right, I forgot about that dependency. I guess there are two main options:

Provide a method to get the vary header(s) for a URL.
Make load fetch a list of responses.

I think down the middle is probably best. I have modified my original trait below:

/// Just an opaque wrapper around a URL.
pub struct CacheKey;

/// An opaque struct with info relevant to caching. Ex: Cache-Control, Vary, and Headers mentioned by Vary.
pub struct CacheHeaders;

pub struct CacheResponse;

impl CacheHeaders {
  /// A key that can be used to identify a cache variant. Writes should replace entries with the same variant key.
  pub fn variant_key(&self) -> u64;
}

impl CacheResponse {
  pub fn new(headers: CacheHeaders, response: Future<Item=Resposne>) -> Self;
}

pub trait Cache {
  fn fetch(&mut self, key: CacheKey) -> Stream<impl CacheResponse>;
  async fn store(&mut self, key: CacheKey, headers: CacheHeaders, response: Response);
  async fn expire(&mut self: key: CacheKey);
}

The only real difference (other than me fixing the async) is that it returns a Stream<Response> instead of Option<Response>. This way reqwest would iterate over the hits and can select the one with matching Vary headers if any.

apelisse commented 4 years ago

FYI, I've used httpcache in Golang, for Kubernetes, and the equivalent is definitely missing in Rust ecosystem.

06chaynes commented 2 years ago

Came across this and thought I'd mention I recently published a caching middleware library for surf and I'm currently looking into getting a version working for reqwest. I use both clients in different projects so I'd love to have a solution for both that can be dropped in as needed. Hope to have some time soon to really get into it.

06chaynes commented 2 years ago

This was really quick and dirty (no one should use this yet) and needs work before publishing but I did manage to (seemingly) get things working. idk what I'm doing though, just kinda winging it

https://github.com/06chaynes/reqwest-middleware-cache

Edit: Cleaned things up and published v0.1.0

seanmonstar / reqwest

Response caching #368