open-contracting / kingfisher-process

Stores and pre-processes OCDS data in a SQL database
https://kingfisher-process.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 8 forks source link

Checker performance options #392

Open jpmckinney opened 1 year ago

jpmckinney commented 1 year ago

One idea is to check the original packages. This would mean using a new check table that links to collection_file (instead of release_check and record_check linking to release and record).

However:

  1. It might be more difficult to analyze errors by OCID (see Error summary section of this notebook).
  2. It might consume too much memory. Some packages are extremely large.
jpmckinney commented 1 year ago

Blocked by https://github.com/open-contracting/lib-cove-ocds/issues/56 re: item 2 above.

jpmckinney commented 1 year ago

It might consume too much memory. Some packages are extremely large.

Indeed: Colombia files, for example, are a few GBs.

Blocked by https://github.com/open-contracting/lib-cove-ocds/issues/56 re: item 2 above.

In lib-cove-ocds, there's the option to read the file from disk. In that case, ijson can parse iteratively. (Would need to parse twice – once for package data and once for each release, like in file_worker.py.)

In kingfisher-process, we'd also have to read the file from disk – not from the DB, as I don't think it's possible to stream jsonb (or bytea) out of PostgreSQL. Edit: Kingfisher Process can read one release at a time from the DB.