OSC-Platform-016: Validation of pipeline architecture with PUDL data

MichaelTiemannOSC commented 2 years ago

The PUDL project transforms regulatory utilities data (capacity, generation, fuel, emissions, location, ownership, etc) into frictionless and tidy data. PUDL is the principal part (but not fully stand-alone) of this GitHub site: https://github.com/catalyst-cooperative. See new release: https://github.com/catalyst-cooperative/pudl/releases/tag/v0.4.0

PUDL has a layered data architecture: Python data scrapers, Zendo storage, sqlite databases, datasette feeding Read The Docs and data dictionaries, Dask/Parquet feeding Jupyter Notebooks, etc.

A snapshot of the PUDL data has been loaded into AWS Athena database for use by our V1 Data Browser. With proper coordination with their project, we could make the Data Commons Platform a first-class target for their system, which would give us exemplary data for US-based power plants that would set a technical standard for European and Asian utilities data projects to follow.

Suggested first step would be for Marcel Hild and @erikerlandson to discuss with PUDL lead Zane Selvans (zane.selvans@catalyst.coop) how we might make this work, with @MichaelTiemannOSC and @toki8 supporting.

caldeirav commented 2 years ago

The first step is to discuss the data integration requirements, not moving straight away to platform / Operate First. We will need to clarify:

What source data is required and by which streams? And for the COP26 demo?
What are available technical mechanism for us to access the data?
How is the data expected to be maintained? How often would it be loaded and what are the requirements for historical sets?
Can we define a minimum scope to validate the functional and technical requirements in an upcoming sprint, and what would this look like?

MichaelTiemannOSC commented 2 years ago

For the June demo, we used the PUDL data to demonstrate two common portfolio alignment use cases: exclusion based on fuel type/fuel mix and exclusion based on CO2(e) intensity. In the first case, investors want to not hold utilities that burn a certain type of fuel (typically coal or nuclear). In the second case, investors want to limit exposure to high-intensity emitters (Mt CO2(e) per MWh generated). This was a stretch goal from Allianz, which we satisfied. It has not yet been set for COP26, but it's very easy to hit with the right data (it's very easy to do with Jupyter Notebooks).

I collected snapshots of the PUDL data by building a PUDL environment and exporting the data I wanted (390MB) to s3://mtiemann-pudl-landing/pudl/. But from the PUDL github website, you can see there are many other ways to connect/federate it (Follow "How do I acccess the data" from this page https://github.com/catalyst-cooperative/pudl to this link: https://sandbox.zenodo.org/record/764696#.YRk6S9NJHVk, which points to 6GB of data). More importantly, I think that when Zane Selvans understands better what we're doing and how it ties to what he's doing, he'll have better answers.

As for the maintenance of the data, as the PUDL people explain, they work principally from original regulatory data that is mostly stable and updated monthly. However, some of that data is retroactively edited (sometimes changing column definitions) which means it does require ongoing management, which the PUDL people do as part of their project. That data is also notoriously inconsistent (due to poor input design forms, sloppy reporting procedures, etc), which also means that the PUDL people have to clean, estimate, and sometimes impute values, which is also part of what they do. What we then need to do is track which versions we want to publsh to the data commons and manage those as downstream of PUDL (which is downstream of EPA, EIA, and others).

The Utilities Pilot diagram in the June 15th Demo documentation (https://github.com/os-c/DataCommons/blob/main/docs/Utility%20Portfolio%20Diagram.pdf) shows a pretty minimal requirement from PUDL: we need EIA 860 data for utility capacity, ownership, location, and overall plant-level statistics, EIA 923 data for specific generators (per-generator fuel types and usage, per-generator MWh of actual generation per month), CEMS summary data for emissions (we don't need the 100GB hourly dataset--monthly summary is fine), and of course data from other sources such as GLEIF (for LEI and corp hierarchy), sample TPT or MIFID portfolios, and basic map data. With this data we can score plants based on fuel mix and/or intensity, feeding the exclusion decision analysis.

But of course part of this exercise is simply defining best practices for sourcing tidy data from a model open source project which, itself, takes great pains to collect, correct, and manage regulatory data for downstream use.

MichaelTiemannOSC commented 2 years ago

Yesterday the PUDL team released version 0.4.0, which has been in the works for the past 1.5 years. We should evaluate reseting our baseline to that before onboarding into the new Data Commons platform: https://github.com/catalyst-cooperative/pudl/releases/tag/v0.4.0

I found many inconsistencies in the earlier PUDL data (still very good), and those inconsistencies resulted in gross disparities between plant-level and generator-level generation, among other problems. There were also inconsistencies related to solar and wind-powered generation that greatly undercounted such green generation among the brown, making realistic evaluation of green capacity and generation meaningful.

MichaelTiemannOSC commented 2 years ago

This morning I downloaded https://zenodo.org/record/5214231/files/pudl-v0.4.0-2021-08-17.tgz?download=1 (6.37GB) to a local drive. It expands into:

├── LICENSE.txt
├── README.md            <- The top-level README for developers using this project.
├── docker-compose.yml
├── environment.yml
├── pudl-jupyter.tar     <- 3.5GB of data that populates the shared/ directory of the notebook instance
├── notebooks            <- the example notebooks from pudl-examples
│   ├── 01-pudl-database.ipynb
│   ├── 02-pudl-outputs.ipynb
│   ├── 03-pudl-parquet.ipynb
│   ├── dask-worker-space
├── pudl_data            <- local copy of the data from pudl-jupyter.tar
│   ├── data
│   |   ├── eia861 (...)
│   |   ├── ferc714 (...)
│   |   ├── tmp (empty)
│   ├── datapkg (empty)
│   ├── parquet
│   |   ├── epacems (...)
│   ├── settings
│   |   ├── etl_fast.yml
│   |   ├── etl_full.yml
│   |   ├── ferc1_solo_test.yml
│   ├── sqlite
│   |   ├── censusdp1tract.sqlite
│   |   ├── ferc1.sqlite
│   |   ├── pudl.sqlite   <- The meat of the PUDL datasets
├── user_data (empty)

To realize the PUDL data in a set of Jupyter notebooks, the following two commands suffice after landing the .tgz file, expanding the archive, and then from the directory created, running these two commands:

docker load -i pudl-jupyter.tar docker-compose up

To onboard this data we need to translate this to our Data Pipeline Architecture:

[ ] Copy the data to an S3 bucket (s3://dev-osc-physical-landing-309737258120/pudl-0.4.0/)
[ ] create an OpenShift pipeline
[ ] translate docker-type instructions to OpenShift-type instructions
[ ] translate docker-compose instructions into whatever creates our notebooks
[ ] Tag the origin of this data (https://doi.org/10.5281/zenodo.3653158)
[ ] Connect PUDL data dictionary to our data catalog

I would like to get a credential to create a pipeline instance and instructions on how to convert from Docker-world to Openshift-world (owning the fact that I'm not very Docker-savvy). Is this my starting point: https://docs.openshift.com/container-platform/4.8/openshift_images/create-images.html ?

caldeirav commented 2 years ago

@MichaelTiemannOSC From what I can see the zenodo archive provides the data set directly in a parquet file. So instead of translating whatever the pipeline does to run on OpenShift, it should be much faster to just load the parquet file in our bucket, and load it into Trino to connect it into our data catalog. Then we can build some automation to do this automatically as and when new data sets are published.

MichaelTiemannOSC commented 2 years ago

I also have the original tgz file in an S3 bucket: s3://dev-osc-physical-landing-309737258120/pudl-v0.4.0-2021-08-17.tgz

But I believe the Parquet files now uploaded are also complete and available here: s3://dev-osc-physical-landing-309737258120/pudl-0.4.0/pudl_data/parquet/

caldeirav commented 2 years ago

@redmikhail as discussed with you we are still facing issues to access arn:aws:s3://dev-osc-physical-landing-309737258120/ which Michael / Madhu Bussa are looking at.

MichaelTiemannOSC commented 2 years ago

I believe I have fixed the problem so that a power user can copy from the physical risk landing zone to the Red Hat data world:

aws s3 cp --profile 647521352890_AWSPowerUserAccess s3://dev-osc-physical-landing-309737258120/datacommon-physicalrisk-landing/wri-pp/wri-pp.tsv s3://ocp-odh-os-demo-s3/data/WRI-upload/

Please let me know if you have any further problems.

caldeirav commented 2 years ago

It is now working. Output as follows:

aws s3 ls --profile 647521352890_AWSPowerUserAccess s3://dev-osc-physical-landing-309737258120 PRE athena_result/ PRE csv/ PRE datacommon-physicalrisk-landing/ PRE dev-osc-physicalrisk-upload/ PRE pudl-0.4.0/ 2021-09-05 07:13:23 6367065260 pudl-v0.4.0-2021-08-17.tgz

caldeirav commented 2 years ago

With the latest commit on https://github.com/os-climate/data-platform-demo/blob/master/notebooks/pudl_ingestion_sample.ipynb we have now validated the proposed ingestion approach for PUDL data.

@MichaelTiemannOSC I propose we should now create a dedicated repo for PUDL data ingestion and start to manage issues at a more detailed level there.

caldeirav commented 2 years ago

Closed as discussed in sprint meeting

os-climate / os_c_data_commons

OSC-Platform-016: Validation of pipeline architecture with PUDL data #34