open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

united_kingdom_fts: incomplete data but no errors in scrapy log #877

Closed duncandewhurst closed 2 years ago

duncandewhurst commented 2 years ago

There seems to be an issue with the UK FTS API. The front-end search reports ~30k notices but the scrape I ran today only returned ~14k notices. Similarly, collection 2321 from August 2021 has less significantly fewer releases than collection 221 from June 2021:

id source_id data_version cached_releases_count
2517 united_kingdom_fts 2021-12-05 22:23:45 13783
2502 united_kingdom_fts 2021-11-25 15:38:25 29216
2321 uk_fts 2021-08-27 08:46:07 6005
2211 uk_fts 2021-06-28 01:14:45 14659
2099 uk_fts 2021-04-21 05:17:35 8448

I checked the log for the most recent scrape and I don't see any issues:

{'downloader/request_bytes': 168308,
 'downloader/request_count': 293,
 'downloader/request_method_count/GET': 293,
 'downloader/response_bytes': 439419935,
 'downloader/response_count': 293,
 'downloader/response_status_count/200': 293,
 'elapsed_time_seconds': 177.531851,
 'file_count': 293,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 11, 25, 15, 41, 22, 821595),
 'item_scraped_count': 293,
 'log_count/DEBUG': 588,
 'log_count/INFO': 13,
 'log_count/WARNING': 2,
 'memusage/max': 127000576,
 'memusage/startup': 87830528,
 'request_depth_max': 292,
 'response_received_count': 293,
 'scheduler/dequeued': 293,
 'scheduler/dequeued/memory': 293,
 'scheduler/enqueued': 293,
 'scheduler/enqueued/memory': 293,
 'start_time': datetime.datetime(2021, 11, 25, 15, 38, 25, 289744)}

I'll share this feedback with the publisher, but is there anything that Kingfisher Collect can do to identify that there has been a problem and log an error so that analysts know the data is incomplete and can advise the publisher of the nature of the problem?

duncandewhurst commented 2 years ago

cc @mrshll1001 @odscrachel for awareness

jpmckinney commented 2 years ago

If I run the following against the data written by Collect on Nov 25, I get 29,216 releases:

cd /home/ocdskfs/scrapyd/data/united_kingdom_fts/20211125_153825
jq '.releases | length' * | paste -sd+ | bc

Collection 2502 also has 29,216 releases. The number of compiled releases from that collection (2503) is 11,446.

On Dec 5, I get 13,783:

cd /home/ocdskfs/scrapyd/data/united_kingdom_fts/20211205_222345
jq '.releases | length' * | paste -sd+ | bc

Collection 2517 has 13,783 releases, and 2518 has 6,644 compiled releases.

Nov 25 made 293 requests: https://collect.kingfisher.open-contracting.org/logs/kingfisher/united_kingdom_fts/b88edfec4e0511ec90ee0c9d92c523cb.log

Dec 5 made 138 requests: https://collect.kingfisher.open-contracting.org/logs/kingfisher/united_kingdom_fts/00e7b824561a11ec90ee0c9d92c523cb.log

The spider just follows a "next" link, one at a time in sequence, until none is available. So, I think there is an issue with the API.

duncandewhurst commented 2 years ago

So, I think there is an issue with the API.

Yep, I was hoping there might be a way to flag it in Kingfisher Collect, but I guess not. I've reported the issue to the publisher in CRM-7756.

jpmckinney commented 2 years ago

Okay, closing as we can't do anything on our end.

duncandewhurst commented 2 years ago

The publish has implemented rate-limiting/throttling:

When using the API you may be affected by the rate limiting/throttling that we now have in place. When using the ocdsReleasePackages your IP address will be limited to 150 calls per minute and 300 calls per 5 minutes.

I'm not convinced this is cause of the issues, but please can the spider be updated to respect the limits? I will also ask them add the limits to the API documentation.

jpmckinney commented 2 years ago

@duncandewhurst Indeed, this isn't the issue. @yolile can correct me, but what's happening is that it reaches the last page of the results (no more "next" link to follow), and yet there are more results in the web version.

yolile commented 2 years ago

In CRM-7756#note-44 the publisher confirmed a bug in their API, and later Rachel confirmed that now the data is complete, so closing this issue.