Checker performance options

jpmckinney commented 1 year ago

One idea is to check the original packages. This would mean using a new check table that links to collection_file (instead of release_check and record_check linking to release and record).

However:

It might be more difficult to analyze errors by OCID (see Error summary section of this notebook).
It might consume too much memory. Some packages are extremely large.

jpmckinney commented 1 year ago

Blocked by https://github.com/open-contracting/lib-cove-ocds/issues/56 re: item 2 above.

jpmckinney commented 1 year ago

It might consume too much memory. Some packages are extremely large.

Indeed: Colombia files, for example, are a few GBs.

Blocked by https://github.com/open-contracting/lib-cove-ocds/issues/56 re: item 2 above.

In lib-cove-ocds, there's the option to read the file from disk. In that case, ijson can parse iteratively. (Would need to parse twice – once for package data and once for each release, like in file_worker.py.)

~~In kingfisher-process, we'd also have to read the file from disk – not from the DB, as I don't think it's possible to stream jsonb (or bytea) out of PostgreSQL.~~ Edit: Kingfisher Process can read one release at a time from the DB.

open-contracting / kingfisher-process

Checker performance options #392