open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

uk_contracts_finder: Add records spider #917

Closed duncandewhurst closed 2 years ago

duncandewhurst commented 2 years ago

Please can we add a spider for the Contracts Finder records endpoint? In CRM-7756, the publisher clarified that historic releases are available via this endpoint, which are not available via the search endpoint.

The endpoint requires an OCID as a parameter. As far as I can see, the only way to get the OCID is via the search endpoint.

Links to bulk download and/or API documentation https://www.contractsfinder.service.gov.uk/apidocumentation/Notices/1/GET-Published-OCDS-Record https://www.contractsfinder.service.gov.uk/apidocumentation/Notices/1/GET-Published-Notice-OCDS-Search

Links to bulk download and/or API endpoints https://www.contractsfinder.service.gov.uk/Published/Notice/records/ocds-b5fd17-f6c03d3b-ba66-4b78-876a-8f3627330d36.json

Link to CRM issue CRM-7756

Priority As soon as possible - the publisher is actively engaging with the helpdesk about issues to do with linking releases.

Data structure The data is structurally valid: https://standard.open-contracting.org/review/data/dbad0031-8fbe-4253-b7a6-ba60e7b944e7

Publication format Record package

jpmckinney commented 2 years ago

Should we remove the other endpoint from Kingfisher Collect?

Having a spider that is limited in its temporal coverage seems undesirable.

duncandewhurst commented 2 years ago

Yep, I think it's fine to just use the records endpoint since it includes full embedded releases.

yolile commented 2 years ago

Yep, I think it's fine to just use the records endpoint since it includes full embedded releases.

Hmm, they are actually linked releases, as far as I can see in your example https://www.contractsfinder.service.gov.uk/Published/Notice/records/ocds-b5fd17-f6c03d3b-ba66-4b78-876a-8f3627330d36.json. Should we write another spider to get the releases from the links listed in each record?

duncandewhurst commented 2 years ago

Whoops, my mistake. Yes we'll need to do that.

Edit: For my own sanity, I double-checked that only the latest release for each ocid is available via https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search despite the individual releases being available via https://www.contractsfinder.service.gov.uk/Published/Notice/releases/{{NoticeId}}.json

yolile commented 2 years ago

@duncandewhurst the new spiders (united_kingdom_contracts_finder_releases and united_kingdom_contracts_finder_records) are deployed now

duncandewhurst commented 2 years ago

Thanks! I've set them both running.

duncandewhurst commented 2 years ago

@yolile @Ravf95 - united_kingdom_contracts_finder_releases completely successfully (Job 2ef35b40b93b11ec88920c9d92c523cb, collection 2620) but it only returned 133k releases whilst the Contracts Finder front-end reports 330k notices.

The are no errors or warnings in collection_file:

SELECT
  *
FROM
    collection_file
WHERE
    collection_id = 2620
   AND (errors IS NOT NULL
      OR warnings IS NOT NULL)

However, looking at the log, I see downloader/response_status_count/403: 232,519.

From the last few messages in the log, it looks like it's the record endpoint that is returning a 403 error so the scraper is not getting to the stage of requesting the actual releases - I guess that's why there is nothing in collection_file.

I spot checked the url of the last request that resulted in a 403 error and I'm able to open it successfully in my browser: https://www.contractsfinder.service.gov.uk/Published/OCDS/Record/ocds-b5fd17-2e8509b7-adc8-11e6-9901-0019b9f3037b

According to the docs the scraper should wait 5 minutes before retrying when a 403 error is returned.

yolile commented 2 years ago

@duncandewhurst we used the Header's response to set the seconds to wait. Apparently, they are not using that header, so I've changed the code to use 300 seconds (5 minutes) as a fixed time. I've deployed the change and started a new crawl now.

duncandewhurst commented 2 years ago

Thanks, @yolile. I will ask them to set the header.

Is it possible for Kingfisher Collect to report errors for 'intermediary' URLs to the Kingfisher Process database? As it is, I could only find the errors by looking at the log file, so we would need to add a step for that to the analyst notebooks in order to make sure collections are complete.

jpmckinney commented 2 years ago

I think it is fine to have to review the log file. I had started work on a simple command to report any issues in the log file (https://github.com/open-contracting/scrapy-log-analyzer), but it's a work-in-progress. It is needed for the data registry (to know that a collection was successful). This would just mean running one command and providing the file path or URL of the log file - this doesn't seem a great burden on analysts, and it can be done in a notebook.

Edit: This maintains a good separation of concerns – with Collect being all about collecting data, and Process being all about processing data – rather than having metadata about errors in collection being passed along through later stages that don't really care about or have dedicated logic about collection.

duncandewhurst commented 2 years ago

Okeydokey, that sounds like a solution. Is the Kingfisher Collect job ID for a collection recorded anywhere in the Kingfisher Process database? That way we can construct the URL for the log file so that the analyst doesn't need to open and log in to https://collect.kingfisher.open-contracting.org/ in a browser.

jpmckinney commented 2 years ago

It's recorded in the new version of Process, which hasn't been deployed yet.

jpmckinney commented 2 years ago

@duncandewhurst Can you clarify which requests you mean by "intermediary"? The uk_contracts_finder spiders will continuously retry URLs that fail (there's actually no logic for accepting a permanent failure).

Edit: Actually, I'm a bit confused how the spider ever terminates. It seems to retry the same URL only once according to the log, but reading the code it seems like it should retry again.

yolile commented 2 years ago

Hmm, I'm looking into this more closely to understand what is happening. I see that the records endpoint is still running (for 8 days now). However, the new releases crawl that I started stopped after around 4 hours, with 403 errors from the beginning. I started the releases crawl locally and I still have no 403 errors after about 3k requests and 30 minutes. So I'm not sure how their requests limit works. Ah, maybe as we ran the records endpoint simultaneously with the releases, which creates some issues?

duncandewhurst commented 2 years ago

@duncandewhurst Can you clarify which requests you mean by "intermediary"? The uk_contracts_finder spiders will continuously retry URLs that fail (there's actually no logic for accepting a permanent failure).

I mean any URLs that the spider crawls before the eventual URLs that return OCDS data. Like the examples described in https://scrapy-log-analyzer.readthedocs.io/en/latest/api/index.html#scrapyloganalyzer.ScrapyLogFile.error_rate

Ah, maybe as we ran the records endpoint simultaneously with the releases, which creates some issues?

Perhaps. Can we stop the long-running records spider? Getting a complete releases dataset is the immediate need.

yolile commented 2 years ago

Hmm, looking at the ongoing records crawl, I see that we have some 404 errors, for example:

DEBUG: Crawled (404) <GET https://www.contractsfinder.service.gov.uk/Published/OCDS/Record/ocds-b5fd17-a6ae33ad-85d8-41da-89f0-3c7fc73d9b56> 
(referer: https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&size=100&page=1509)

And when I go to https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&size=100&page=1509 I cannot indeed find the ocds-b5fd17-a6ae33ad-85d8-41da-89f0-3c7fc73d9b56 ocid

Same with others, for example

DEBUG: Crawled (404) <GET https://www.contractsfinder.service.gov.uk/Published/OCDS/Record/ocds-b5fd17-b3b443f3-97ab-4604-bfbc-052aae617720> 
(referer: https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&size=100&page=1393)

Do we know how they are generating their ocids? In any case, yes, I will stop the records spider and fix the code only to retry 403 errors and not all of them and see if we have more luck. I can also change the spider to request one request per time and not multiple ones in parallel; that will make the spider slow, but maybe we will have fewer 403 errors.

duncandewhurst commented 2 years ago

Thank you!

And when I go to https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&size=100&page=1509 I cannot indeed find the ocds-b5fd17-a6ae33ad-85d8-41da-89f0-3c7fc73d9b56 ocid

Could that be because it's no longer on that page of the results?

Do we know how they are generating their ocids?

I don't know, but once the updated release spider completes, I'll share the issues with the publisher and ask them how the OCIDs are generated.

yolile commented 2 years ago

Could that be because it's no longer on that page of the results?

Yes, it could be, but it is weird to have a 404 error when requesting the record for a release ocid listed somewhere/somewhen in the search endpoint.

yolile commented 2 years ago

I deployed the changes and started a new crawl now

jpmckinney commented 2 years ago

I mean any URLs that the spider crawls before the eventual URLs that return OCDS data. Like the examples described in https://scrapy-log-analyzer.readthedocs.io/en/latest/api/index.html#scrapyloganalyzer.ScrapyLogFile.error_rate

Kingfisher Collect (unless there's a bug in a spider) always yields a FileError item if any URL ultimately fails (retries don't yield, but intermediary URLs do). Kingfisher Process stores FileErrors in the DB. So, they should have appeared in the DB.

That said, having collection errors in the DB is somewhat duplicative of the log file, so perhaps we should just remove that logic, as it simply adds one more potential point of failure.

jpmckinney commented 2 years ago

Edit: Actually, I'm a bit confused how the spider ever terminates. It seems to retry the same URL only once according to the log, but reading the code it seems like it should retry again.

The spider that finished might have been running an older version of the code. The latest records job 346b1b58b93b11ec88920c9d92c523cb was indeed never going to end, requesting the same 404 URLs hundreds of thousands of times.

I think all spiders that retry URLs should have max_attempts or max_retries logic. This is already the case for portugal_base, paraguay_dncp_base, paraguay_hacienda (I checked all spiders that set dont_filter). We should add this for the UK.

jpmckinney commented 2 years ago

We can maybe try using https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.retry.get_retry_request, a new feature in Scrapy that handles max_retry_times logic, so that we don't need to write it ourselves.

Edit: Nevermind, that's only if we use a spider (downloader) middleware, and it doesn't have the concept of wait_time.

yolile commented 2 years ago

We should add this for the UK.

True.

I also set the concurrency limit to 1 for UK spiders in my last commit, and the current running crawl has not any 403 errors so far, but I'm not sure how much it will take to be complete now. On the other hand, if we make concurrent requests, we get 403 errors more frequently, so we have to wait 5 minutes for each failed request, so I'm not sure if we want concurrent requests or not for the UK.

jpmckinney commented 2 years ago

Let's leave it at 1, since they seem to have aggressive throttling, but no documentation of what's allowed 🤷‍♂️

yolile commented 2 years ago

I think all spiders that retry URLs should have max_attempts or max_retries logic

Included as part of https://github.com/open-contracting/kingfisher-collect/pull/935/commits/835b788dc2b596e6402ce76b04928bf13cd6f54a so I think we can close this issue now