Closed duncandewhurst closed 3 years ago
If you're in a hurry, you can add -d download_delay=1
at the end of the scheduling command, for a 1 second download delay between requests (or whatever number their API will allow).
Sorry, that should be -d setting=DOWNLOAD_DELAY=1
https://scrapyd.readthedocs.io/en/stable/api.html#schedule-json https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
-d setting=DOWNLOAD_DELAY=1
reduced the number of errors from ~1,500 to ~500.
I've set another scrape running with -d setting=DOWNLOAD_DELAY=3
to see if that helps.
The errors seem to come in batches, e.g. pages 106-120, pages 278-283, pages 448-452 etc.
thanks for the update, maybe the problem is similar to Portugal, where we need to wait a few seconds after an error before retrying. I can work on that as a priority if this last try doesn't work either
Since the error is periodic, I assume they rate limit at a fairly large window, e.g. maximum ### requests per # minutes. Can we ask the publisher what their rate limit is (if it's not already published)?
well, they actually publish the rate limit: https://www.contractsfinder.service.gov.uk/apidocumentation/Notices/1/GET-Published-Notice-OCDS-Search?_ga=2.259157775.486319066.1613599297-1084433742.1613410239
When the user has submitted too many requests, no further requests should be made until after the number of seconds specified in the Retry-After header value
@jpmckinney should we use the DelayedRequestMiddleware
using the Retry-After header value as wait_time
?
Sounds good to me! It’s too bad they only tell you to wait after the error occurred.
For posterity, the scrape with -d setting=DOWNLOAD_DELAY=3
completed with only 25 errors. But respecting 'Retry-After' seems like a better solution.
Recent runs of the
uk_contracts_finder
spider (collections1883
and1889
) returned lots of HTTP 429 errors, presumably due to rate limiting.Can the spider be updated to account for this?
I suggest this is a high priority issue as there is interest in looking at the UK data in the context of the green paper response.
cc @odscrachel @mrshll1001