open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

uk_contracts_finder: http 429 errors #612

Closed duncandewhurst closed 3 years ago

duncandewhurst commented 3 years ago

Recent runs of the uk_contracts_finder spider (collections 1883 and 1889) returned lots of HTTP 429 errors, presumably due to rate limiting.

Can the spider be updated to account for this?

I suggest this is a high priority issue as there is interest in looking at the UK data in the context of the green paper response.

cc @odscrachel @mrshll1001

jpmckinney commented 3 years ago

If you're in a hurry, you can add -d download_delay=1 at the end of the scheduling command, for a 1 second download delay between requests (or whatever number their API will allow).

jpmckinney commented 3 years ago

Sorry, that should be -d setting=DOWNLOAD_DELAY=1

https://scrapyd.readthedocs.io/en/stable/api.html#schedule-json https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

duncandewhurst commented 3 years ago

-d setting=DOWNLOAD_DELAY=1 reduced the number of errors from ~1,500 to ~500.

I've set another scrape running with -d setting=DOWNLOAD_DELAY=3 to see if that helps.

The errors seem to come in batches, e.g. pages 106-120, pages 278-283, pages 448-452 etc.

yolile commented 3 years ago

thanks for the update, maybe the problem is similar to Portugal, where we need to wait a few seconds after an error before retrying. I can work on that as a priority if this last try doesn't work either

jpmckinney commented 3 years ago

Since the error is periodic, I assume they rate limit at a fairly large window, e.g. maximum ### requests per # minutes. Can we ask the publisher what their rate limit is (if it's not already published)?

yolile commented 3 years ago

well, they actually publish the rate limit: https://www.contractsfinder.service.gov.uk/apidocumentation/Notices/1/GET-Published-Notice-OCDS-Search?_ga=2.259157775.486319066.1613599297-1084433742.1613410239

When the user has submitted too many requests, no further requests should be made until after the number of seconds specified in the Retry-After header value

yolile commented 3 years ago

@jpmckinney should we use the DelayedRequestMiddleware using the Retry-After header value as wait_time ?

jpmckinney commented 3 years ago

Sounds good to me! It’s too bad they only tell you to wait after the error occurred.

duncandewhurst commented 3 years ago

For posterity, the scrape with -d setting=DOWNLOAD_DELAY=3 completed with only 25 errors. But respecting 'Retry-After' seems like a better solution.