HTTP cache not working in some cases

scrapy-plugins / scrapy-zyte-api

Zyte API integration for Scrapy

BSD 3-Clause "New" or "Revised" License

35 stars 19 forks source link

HTTP cache not working in some cases #125

Open Gallaecio opened 1 year ago

Gallaecio commented 1 year ago

I have seen 2 people now having trouble with HTTP cache in combination with scrapy-zyte-api.

They set HTTPCACHE_ENABLED to True, and they get NotSupported("Response content isn't text").

I could not reproduce the issue myself with the tutorial spider, so I am not sure why it happens.

If someone affected by this issue can provide a minimal, reproducible example, that would be great!

Gallaecio commented 11 months ago

I got someone else last week reporting the same issue. I could not reproduce it either.

I wonder if it is a simple matter of getting a random empty body response due to an undetected ban. The default cache policy will not ignore an HTTP 200 response even if its body is empty, so it will be cached and fail as reported when trying to use response.text (which is a separate Scrapy issue, empty responses should be considered text responses).

In those cases, creating a custom cache policy that ignores empty responses should be relatively easy (subclass the DummyPolicy, override should_cache_response).

Gallaecio commented 9 months ago

One thing that does happen is that responses from the cache do not use the original response class, so you cannot access raw_api_response from cached responses.

Gallaecio commented 8 months ago

@kmike @wRAR I see 2 possible approaches here:

Modify upstream Scrapy to support custom response classes declaring how they are serialized.
Implement a custom file system storage class for scrapy-zyte-api, to be configured as HTTPCACHE_STORAGE.

The first approach seems the best to me, but I am not 100% sure how to best implement such an API in a way that plays well with the existing storage classes (DBM and file system) in a backward-compatible way.

The second approach would be easier to implement, and cover most practical scenarios, but make corner case scenarios rather difficult (e.g. supporting additional Response classes beyond those from scrapy-zyte-api, or storing the cache somewhere other than the file system).

Any thoughts?

jpabbuehl commented 1 month ago

@Gallaecio I'm hitting this exact issue right now, and would love to know if going with option 2 is the preferred option for now. If yes, where to start? i see the raw_api_response is not cached here, so it should be straightforward? https://github.com/scrapy/scrapy/blob/67ab8d4650c1e9212c9508803c7b5265e166cbaa/scrapy/extensions/httpcache.py#L371-L382

Gallaecio commented 1 month ago

Yes, you could implement option 2 on your own, subclass FilesystemCacheStorage:

Extend store_response to include the serialization of raw_api_response.
Override retrieve_response to build a scrapy_zyte_api.responses.ZyteAPITextResponse that includes it, instead of a Response that does not.
Configure your custom class as the HTTPCACHE_STORAGE backend.