open-contracting / kingfisher-process

Stores and pre-processes OCDS data in a SQL database
https://kingfisher-process.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 8 forks source link

Unable to compile with extensions, trying without them (File is not a zip file) #339

Closed sentry-io[bot] closed 2 years ago

sentry-io[bot] commented 3 years ago

This single error consumed our entire Sentry quota. Perhaps it just needs to be "warning" level, not "error" level? I haven't looked at the code (this is @jpmckinney).

Sentry Issue: DATA-REGISTRY-KINGFISHER-PROCESS-1

Unable to compile with extensions, trying without them File is not a zip file
jakubkrafka commented 3 years ago

Sorry for the quota:( This is a quite tricky part, when this

builder = ProfileBuilder(settings.COMPILER_OCDS_VERSION, extensions)
schema = builder.patched_release_schema()
merger = ocdsmerge.Merger(schema)
out = merger.create_compiled_release(releases)

throws out an exception. Frankly - I'm not sure with this code myself:( Definitely changed to warning level,

Unfortunately, there are many other potential quota consumers, namely in pelican-backend (processing of malformed data can crash zillion times for a single dataset easily)

jpmckinney commented 3 years ago

@jakubkrafka Where can I see logs of warnings/errors/crashes? If I can get more detail on an error, I can potentially make the library more robust.

The warning should at minimum repeat the original error and backtrace. It should also include the extensions and ocid, so that it is possible to reproduce the problem.

jakubkrafka commented 3 years ago

Complete logs can be seen in the Sentry directly (breadcrumbs section). There is always logged the ocid and collection when the processing started (as INFO level).

Another way to inspect logs is to check docker container directly, such a "docker logs kingfisher-process_record_compiler_2 --tail 200"

Original error is there "File is not a zip file", however a bit cryptic now. I've added a bit more descriptive text, it will be "Root exception: File is not a zip file" from now on.

Regarding extensions - i'm not sure that we want so much info in logs. It can be huge and repeating quite often. And it's easy to get those with the ocid.

jakubkrafka commented 3 years ago

One additional side note, which popped up in my mind. The log is now changed to warning -> it will not pop up in sentry anymore.

jpmckinney commented 3 years ago

@yolile I didn't have time to investigate this. It's not especially urgent, because (1) the behavior is to merge without retrieving extensions if the extension retrieval raises an error, which is expected to produce the correct result, and (2) the log level is now warning, meaning we can re-enable Sentry without this error/warning consuming our entire quota.

Update: The errors in Sentry have now expired, so you would have to either find the Docker logs (I think they might be gone now too), or we will have to wait until we process more collections, and check the logs for warnings.

My suspicion is that some extensions URLs in packages might not end in / or /extension.json, in which case the ProfileBuilder assumes the file is a "Download URL" (a ZIP file) – which is true when building a profile from GitHub, which has URLs like https://api.github.com/repos/open-contracting-extensions/ocds_bid_extension/zipball/v1.1.5

But, there is probably some publisher whose URLs aren't correct (i.e. ending in /extension.json per OCDS), and it's causing the error. Having said all that, maybe you can find it by querying the helpdesk Kingfisher Process database.

Once we know what these bad URLs look like, we can maybe filter them out in Kingfisher Process (and log a warning that an extension was skipped).

jpmckinney commented 2 years ago

Noting that I restored exception-level logging in 4815913a. There is a TODO in the code once we do reproduce it to use more specific exception classes.

"File is not a zip file" is due to ocdsextensionregistry trying to retrieve a ZIP file of an extension repository on GitHub, based on the URLs in extension_registry, but not receiving a ZIP file. The extension registry has not been changed in any relevant way, so either:

  1. The publisher had a non-registered extension with a bad URL
  2. GitHub was failing intermittently

Note that in 565c7e85 I cache the "merger", so if this error re-occurs, it will not create zillions of Sentry messages (just one per worker).

Anyhow, there is no further action to take, so closing.