OSC-Platform-024: Metadata about databases and tables

caldeirav commented 2 years ago

As a data scientist, when creating or changing a schema for ingesting or processing data, I want to be able to capture metadata about the database / table / columns. The data could include name, description, units, as well as associated documentation for the structure and should support the ability to tie the metadata to front-end visualisation.

Reference from @MichaelTiemannOSC: https://docs.datasette.io/en/stable/metadata.html

MichaelTiemannOSC commented 2 years ago

Two other (meta)data points:

PUDL metadata for ingest: https://github.com/catalyst-cooperative/pudl/issues/1180 and https://github.com/catalyst-cooperative/pudl/pull/806

PUDL metadata for output: https://github.com/catalyst-cooperative/pudl/issues/1198

MichaelTiemannOSC commented 2 years ago

Also this background material shared by LinuxFoundation: https://egeria.odpi.org/

MichaelTiemannOSC commented 2 years ago

For the purposes of discussion, let's consider the metadata problem for two cases: the WRI PP data and the Microsoft Corp ESG statistics report (hand-curated by Tiemann from https://aka.ms/MSFTsustainabilityreport2020#page=73).

In the case of WRI data (which starts life as a single table), it falls into 2-3 principal downstream tables: Plants (which have lots of statistics such as plant name, lat/lon, owner name, fuel type, capacity, data source etc), Generation GWh (which is just a unique ID, year, and GWh generated), and Estimated GHw (which is also just a unique ID, year, and estimated GWh generated). There are OCF factors for power generation (likely denominated in MWh or GJ). But how, really, do we want to expose and discover names that users can browse? How should we insert sector/industry information for a dataset that is purely within ISIC code 351 (which covers 3501 (understandably) and 4911 (paradoxically)). In any case, we need a good answer concerning the metadata required to land the WRI dataset.

In the case of Microsoft data ...will resume editing...

caldeirav commented 2 years ago

Based on our email thread discussing WRI GPPD metadata ingestion:

We need a proper metadata store (data catalogue) and we should not do this in the same schema.
There is metadata to be maintained at schema level (WRI), table level (GPPD), and field level (definition of fields for global_power_plant_database.csv).

In the case of WRI GPPD i believe most of this information is contained in the README.txt file which comes with the data, however not in a form that can be automatically ingested (the field definition portion though should not be too difficult to extract and ingest).

At this stage I can see broadly two potential approach to handle this:

A short term, quick-to-implement approach but which also has limitations, which is to create a metadata store for lookup in Trino - I believe this is what the PUDL team is actually going towards (and implementing in sqlite). Advantages: easy to implement, can be migrated later to a better solution. Inconvenient: need to manage updates in schema, table, field via updates of the metadata store in Trino from the pipeline code (prone to human error).
A long term solution which is more scalable but will take time to deploy and integrate with our flow - for this I have been looking at Apache Atlas / LinkedIn DataHub. Advantages: Event-sourced metadata i.e. each consumer / domain can more easily consume but also enrich metadata without central dependency. Inconvenient: significantly more complex to implement.

@MichaelTiemannOSC please confirm my understanding of the metadata requirements above, as well as the source.

toki8 commented 2 years ago

@MichaelTiemannOSC referred to this example: https://github.com/catalyst-cooperative/pudl/pull/806

MichaelTiemannOSC commented 2 years ago

Link to LF metadata project: https://lfaidata.foundation/projects/egeria/

MichaelTiemannOSC commented 2 years ago

@caldeirav With respect to the WRI GPPD data you are correct about the primary metadata source for the WRI data (README.txt). And of course when certain metadata terms are not completely self-contained, such as the term TWh, there are additional dictionaries that are required (to make data comparable to GWh and MWh).

Most of the publicly sourced datasets, such as RMI, PUDL, etc., have well-defined data dictionaries that similarly provide table-level and field-level metadata, as well as catalog information (tables of abbreviation keys, for example). In the case of RMI, there s a data dictionary in both PDF and XLSX format.

Most of the corporate reports are unstructured, and hence data must be discovered and metadata derived.

MichaelTiemannOSC commented 2 years ago

Metadata Store consists of 3 tables:

Schema-level metadata
1. Schema name
2. Title, Abstract, etc.
Table-level metadata
1. Table name (plant, annual_gwh, estimated_gwh)
2. Parent schema
3. Table source and processing pipeline
Field-level metadata
1. Field name
2. Parent table
3. Field type (physical -- integer, float, varchar, etc)
4. Field type (dimensional -- MWh, MtCO2(e), etc)

Need to validate against YAML files in ASDI and other open data sources (should be easier than ingesting READMEs)

Ontology Store consists of various taxonomies:

ESG dimensions
ISO-3166 Countries
ISIC Sectors
etc.

MichaelTiemannOSC commented 2 years ago

I just pushed changes here that creates the metadata contemplated above: https://github.com/os-climate/wri-gppd-ingestion-pipeline

As documented in the commit and in the code, the next step is we need to sort out the storage location of the metadata store and whether it's really part of the ingest pipeline (allowing anybody who can ingest to write metadata) or part of some other role that needs to be executed separately. Hopefully we can unblock those items so that further metadata prototyping is possible.

@caldeirav @erikerlandson

caldeirav commented 2 years ago

Based on what we discussed during the call my thoughts would be:

The metastore should be on the same bucket as the data ingested / served, so for development data it will be on ocp-odh-os-demo-s3.
As for the model, the metastore will be under the same catalogue osc_datacommons_dev, the schema is metastore, the table names could be meta_schema, meta_table, meta_field.
As for creation process for the structure of the metastore, it should definitely not be in any given processing pipeline. I guess we can create a repository for the metastore component, starting with a simple pipeline to re-create the store structure, and over time we can build more advanced capabilities as required.
The ingestion pipeline should have to deal with only inserting the information (ideally only with INSERT and versioning but right now this is not useful until we have pipeline versioning implemented).

caldeirav commented 2 years ago

@MichaelTiemannOSC i am starting to have a look today at visualization integration (with superset) as I think it is worth looking at how the meta is used as we finalize the structure and management process.

MichaelTiemannOSC commented 2 years ago

Based on what we discussed during the call my thoughts would be:

The metastore should be on the same bucket as the data ingested / served, so for development data it will be on ocp-odh-os-demo-s3.

OK

As for the model, the metastore will be under the same catalogue osc_datacommons_dev, the schema is metastore, the table names could be meta_schema, meta_table, meta_field.

OK

As for creation process for the structure of the metastore, it should definitely not be in any given processing pipeline. I guess we can create a repository for the metastore component, starting with a simple pipeline to re-create the store structure, and over time we can build more advanced capabilities as required.

I said as much in the comments in the code.

The ingestion pipeline should have to deal with only inserting the information (ideally only with INSERT and versioning but right now this is not useful until we have pipeline versioning implemented).

Taking the last two bullets together, we need some sort of mechanism (triggers, events, some kind of finalization routine) to get the metadata action to happen automatically after the ingestion takes place. And we need some way to know whether the ingestion action is adding or replacing existing table information.

I'll start making changes now, but will need help finishing out how ingestion causes the metadata to be loaded. I'm going to assume that creating metadata dataframes (as this example does) provides all the necessary custom information that can then be loaded via a completely generic process that every ingestion pipeline can feed into.

MichaelTiemannOSC commented 2 years ago

I've now checked in changes that should demonstrate all of the above.

MichaelTiemannOSC commented 2 years ago

I have pushed a new branch adapting the pyarrow implementation prototyped by Vincent: https://github.com/os-climate/wri-gppd-ingestion-pipeline/tree/metadata-v1

Please comment

caldeirav commented 2 years ago

Looks fine to me, although i'd probably consider having the fields in the same metadata content as the table metadata (since it is JSON, can just put them under a fields element).

MichaelTiemannOSC commented 2 years ago

I have uploaded the latest version of PUDL to the physical landing bucket: s3://redhat-osc-physical-landing-647521352890/PUDL/pudl-v0.5.0-2021-11-14.tgz

This new version has a substantially rewritten metadata system that represents about a year's worth of development work.

caldeirav commented 2 years ago

Should we have a session with the PUDL team to understand their work on metadata? It should be good input for us.

On Mon, Nov 15, 2021 at 5:36 PM Michael Tiemann @.***> wrote:

I have uploaded the latest version of PUDL to the physical landing bucket: s3://redhat-osc-physical-landing-647521352890/PUDL/pudl-v0.5.0-2021-11-14.tgz

This new version has a substantially rewritten metadata system that represents about a year's worth of development work.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/os-climate/os_c_data_commons/issues/48#issuecomment-968703444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTZHYHUQZK722JOPN23U53UMDH7ZANCNFSM5DQECDFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

HeatherAck commented 2 years ago

+1 on PUDL session - @MichaelTiemannOSC if you have a contact name, I can work with Ava to get something on the calendar.

zaneselvans commented 2 years ago

The DB structure has changed slightly, there are many more natural primary keys / foreign key relationships now, and more constraints being enforced on the codes in various columns, but I think beyond that, most of the metadata changes mostly just affect the data production side of things. But hopefully in the next release we will have gone ahead and used it to provide richer descriptions at the table and column level, and have those annotations exportable in machine readable formats. And also (finally) get the eia861 and ferc714 data into the database for real.

os-climate / os_c_data_commons

OSC-Platform-024: Metadata about databases and tables #48