Open Gallaecio opened 1 year ago
I got someone else last week reporting the same issue. I could not reproduce it either.
I wonder if it is a simple matter of getting a random empty body response due to an undetected ban. The default cache policy will not ignore an HTTP 200 response even if its body is empty, so it will be cached and fail as reported when trying to use response.text
(which is a separate Scrapy issue, empty responses should be considered text responses).
In those cases, creating a custom cache policy that ignores empty responses should be relatively easy (subclass the DummyPolicy, override should_cache_response
).
One thing that does happen is that responses from the cache do not use the original response class, so you cannot access raw_api_response
from cached responses.
@kmike @wRAR I see 2 possible approaches here:
HTTPCACHE_STORAGE
.The first approach seems the best to me, but I am not 100% sure how to best implement such an API in a way that plays well with the existing storage classes (DBM and file system) in a backward-compatible way.
The second approach would be easier to implement, and cover most practical scenarios, but make corner case scenarios rather difficult (e.g. supporting additional Response classes beyond those from scrapy-zyte-api, or storing the cache somewhere other than the file system).
Any thoughts?
@Gallaecio I'm hitting this exact issue right now, and would love to know if going with option 2 is the preferred option for now. If yes, where to start? i see the raw_api_response is not cached here, so it should be straightforward? https://github.com/scrapy/scrapy/blob/67ab8d4650c1e9212c9508803c7b5265e166cbaa/scrapy/extensions/httpcache.py#L371-L382
Yes, you could implement option 2 on your own, subclass FilesystemCacheStorage
:
store_response
to include the serialization of raw_api_response
.retrieve_response
to build a scrapy_zyte_api.responses.ZyteAPITextResponse
that includes it, instead of a Response
that does not.HTTPCACHE_STORAGE
backend.
I have seen 2 people now having trouble with HTTP cache in combination with scrapy-zyte-api.
They set
HTTPCACHE_ENABLED
toTrue
, and they getNotSupported("Response content isn't text")
.I could not reproduce the issue myself with the tutorial spider, so I am not sure why it happens.
If someone affected by this issue can provide a minimal, reproducible example, that would be great!