Closed duncandewhurst closed 2 years ago
cc @mrshll1001 @odscrachel for awareness
If I run the following against the data written by Collect on Nov 25, I get 29,216 releases:
cd /home/ocdskfs/scrapyd/data/united_kingdom_fts/20211125_153825
jq '.releases | length' * | paste -sd+ | bc
Collection 2502 also has 29,216 releases. The number of compiled releases from that collection (2503) is 11,446.
On Dec 5, I get 13,783:
cd /home/ocdskfs/scrapyd/data/united_kingdom_fts/20211205_222345
jq '.releases | length' * | paste -sd+ | bc
Collection 2517 has 13,783 releases, and 2518 has 6,644 compiled releases.
Nov 25 made 293 requests: https://collect.kingfisher.open-contracting.org/logs/kingfisher/united_kingdom_fts/b88edfec4e0511ec90ee0c9d92c523cb.log
Dec 5 made 138 requests: https://collect.kingfisher.open-contracting.org/logs/kingfisher/united_kingdom_fts/00e7b824561a11ec90ee0c9d92c523cb.log
The spider just follows a "next" link, one at a time in sequence, until none is available. So, I think there is an issue with the API.
So, I think there is an issue with the API.
Yep, I was hoping there might be a way to flag it in Kingfisher Collect, but I guess not. I've reported the issue to the publisher in CRM-7756.
Okay, closing as we can't do anything on our end.
The publish has implemented rate-limiting/throttling:
When using the API you may be affected by the rate limiting/throttling that we now have in place. When using the ocdsReleasePackages your IP address will be limited to 150 calls per minute and 300 calls per 5 minutes.
I'm not convinced this is cause of the issues, but please can the spider be updated to respect the limits? I will also ask them add the limits to the API documentation.
@duncandewhurst Indeed, this isn't the issue. @yolile can correct me, but what's happening is that it reaches the last page of the results (no more "next" link to follow), and yet there are more results in the web version.
In CRM-7756#note-44 the publisher confirmed a bug in their API, and later Rachel confirmed that now the data is complete, so closing this issue.
There seems to be an issue with the UK FTS API. The front-end search reports ~30k notices but the scrape I ran today only returned ~14k notices. Similarly, collection 2321 from August 2021 has less significantly fewer releases than collection 221 from June 2021:
I checked the log for the most recent scrape and I don't see any issues:
I'll share this feedback with the publisher, but is there anything that Kingfisher Collect can do to identify that there has been a problem and log an error so that analysts know the data is incomplete and can advise the publisher of the nature of the problem?