singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Update PUDL Dependencies #310

Closed grgmiller closed 6 months ago

grgmiller commented 7 months ago

This issue replaces https://github.com/singularity-energy/open-grid-emissions/issues/284, which is now out of date from when it was originally written back in February.

The OGE pipeline depends on data from PUDL, and also currently makes use of some of the code from the pudl.analysis module. However, since the last OGE update in 2022, several big changes are happening with PUDL that affect our dependency on the project:

  1. Previously, we could only access PUDL data by downloading data that Catalyst posted to Zenodo. However, Catalyst is now publishing nightly builds of the data on AWS, which is based on the most up to date version of their dev branch. We will need to update download_data.py to allow an option to download the pudl.sqlite file from the nightly build. Ideally, we will want to use a stable, versioned file for our inputs so that it can be cited/referenced in the final data, or we will just need to archive the version of the database we download and use for our outputs.
  2. The pudl_out is being deprecated, and instead all of these tables will be pre-compiled and saved in pudl.sqlite. Thus, instead of having to use pudl_out, we can now just directly read all of these tables from the database using pd.read_sql()
  3. catalystcoop.pudl is no longer going to be published as a software package, which I think means that we will no longer be able to import and use the pudl.analysis functions in our pipeline. We'll have to see how much this will affect our pipeline, but for example in data_cleaning.clean_eia923(), we use pudl.analysis.allocate_net_gen.allocate_gen_fuel_by_generator_energy_source() to load the data, and then we do our own transformations on the data before running pudl.analysis.allocate_net_gen.agg_by_generator(). This may just mean that we need to copy this second function into the OGE codebase, or try to merge some of our data cleaning into the pudl codebase.
  4. Many of the tables and column names from the tables that we use are being renamed, so we will need to go through the code base and update those.

One of the additional benefits of this change is that the output tables that we will read from the database will contain all historical years, and thus will have better backfilling of missing data. When we were running pudl_out before, we were only loading data using a single year, which means that many of the backfill functions for things like BA codes would not work as effectively.

I believe that all of the information about these updates from the PUDL side are documented in the following places:

To do:

General steps:

grgmiller commented 6 months ago

Addressed by https://github.com/singularity-energy/open-grid-emissions/pull/318