open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 6 forks source link

tabulate: Documentation #145

Closed duncandewhurst closed 3 years ago

duncandewhurst commented 3 years ago

In the course of preparing the OCDS Kit Learning Lab I authored the following documentation for the tabulate command.

If we can add this to the OCDS Kit docs then I can reference it from the Learning Lab

OCDS Kit's tabulate command can flatten OCDS data and store it in a relational database.

Documentation

Database structure

The tabulate command generates a database structure based on the OCDS schema. The main table is the releases table and separate tables are generated for each array in the schema, for example parties and awards.

By default the latest OCDS schema is used. An alternative, or extended, schema can be specified using the --schema option.

Naming conventions

Table and column names in the database are generated from the object and property names in the JSON data, separated by underscores. For example, data from tender/title is stored in releases.tender_title and data from awards/items/description is stored in award_items.description.

Identifiers

The ocid and release id columns are present in all tables, so that data from the same contracting process and release can be joined. Where a table represents a property of an array in the JSON data, a column with the id of the parent object is also provided. For example, the award_id column is present in the awards_suppliers table so that the data can be joined to the awards table.

Additional fields

Properties in the JSON data which are not specified in the schema are treated as follows:

  • If the property is a scalar value or object it stored as a property of a JSON object in the extras column of the table for the parent object. For example, awards/additionalField would be stored in awards/extras.
  • If the property is an array, it is dropped and OCDS Kit reports a warning. For example, table tender_participationFees does not exist.

To flatten additional fields, use the --schema option to specify an extended schema which includes the fields.

Alternative approaches

OCDS Kingfisher Process provides an alternative approach to storing OCDS data in a database, and includes a pre-processing pipeline with support for validating and compiling data.

Flatten-tool provides an alternative approach to flattening OCDS data and includes support for additional fields and arrays.

jpmckinney commented 3 years ago

Would you like to work up a PR?