Update PUDL Dependency to v2023.12.01

grgmiller commented 7 months ago

Purpose

This PR updates the OGE pipeline to work with the latest version of the pudl databases and code. As summarized in https://github.com/singularity-energy/open-grid-emissions/issues/310, there have been many major changes to pudl in the past year.

Fixes CAR-2966, CAR-2969, CAR-2902

Closes #310

What the code is doing

In pipfile:

Updates to python version 3.11 and removes dependency on python-snappy to work with pudl
Updates the dependency to a new fork of pudl that lives at singularity-energy/pudl. This was forked from pudl/main based on the 2023.12.01 release version of pudl.

In download_data:

download_pudl_data() was updated to allow pudl database files to be downloaded either from zenodo or aws.
aws-sourced files are now saved to data/downloads/pudl (which is where the zenodo files were previously downloaded), and zenodo files are downloaded to data/downloads/pudl_zenodo
However, zenodo-sourced files are no longer supported in the pipeline, at least until they are updated to match the structure of the aws-sourced data. See future work section below.

In load_data:

CEMS data is now stored in a single parquet file, so is now read using a single pd.read_parquet() instead of having to load and concat each state-year file separately.
Remove old load_pudl_table() and replace initialize_pudl_out() with a new load_pudl_table() function. We no longer have to use a pudl_out object to extract tables from the pudl database, since all tables are now saved in final form in the database. This function now reads the data directly with various options for which years and columns to load.
Removes functions for loading the EIA-860 boiler control association and boiler design parameter tables, since these have now been integrated into PUDL.

In data_cleaning:

Instead of running pudl.analysis.allocate_gen_fuel.allocate_gen_fuel_by_generator_energy_source() in our pipeline, we now load the output of this function from the pudl database.
The rest of the changes are column name updates, formatting updates, and updates to how we load the pudl data.
Given an update to the allocated EIA-923 file format, certain fuels were being dropped during the plant primary fuel identification process. This PR fixes that issue by updating how the generator primary fuel is calculated so this is not dropped.

In emissions:

Fixes a merge error by updating the order in which nox and so2 emission factors are aggregated. When calculating generator-specific nox and so2 factors, we had been averaging together the factors for each boiler associated with each generator before standardizing the units of the emission factors, which was resulting in duplicate EFs for a single generator. We now do this averaging once the factors have been standardized to avoid the duplication.

Testing

Ran the entire data pipeline for 2020.

Where to look

The major substantial changes occur in download_data and load_data. data_cleaning and emissions also got substantial changes.

Otherwise, most of the changes in other files are just changing all of the pudl_out references to the new load_pudl_table() function.

The manual data files containing emissions factors for nox and so2 were updated use use pudl-formatted firing types, and replace "N/A" values in the table with "none" to help with matching.

Ignore sandbox.ipynb

In column_checks.py, several column names were updated and the dtypes dict was sorted alphabetically by key.

Usage Example/Visuals

N/A

Review estimate

20-30 min

Future work

Once there is a 2023.12.01 data release on zenodo, check to make sure that the database format is updated and potentially switch back to the zenodo version of the downloaded files.
There are new prime movers that are flagged for missing NOx emission factors. This seems to be a result of switching the boiler bottom codes and boiler fuel codes from "N/A" to "none". Not sure if these were being ignored before or if the warning just needs to be updated.

Checklist

[x] Update the documentation to reflect changes made in this PR
[x] Format all updated python files using black
[x] Clear outputs from all notebooks modified
[x] Add docstrings and type hints to any new functions created

grgmiller commented 6 months ago

@rouille Ok the pipeline successfully ran on my computer, so we should be good to go.

HOWEVER, I just learned from the folks at Catalyst that the table names/columns are going to change again next week with a big rename PR that will be going through. See: The rename PR PUDL data dictionary from the rename PR docs build

So this PR will make things current to the 2023.12.01 pudl data release, but not the ultimate format that will be released next week.

I'm going to propose that we go ahead and merge this PR so that we can keep working on other issues, and then I can start a new PR after next week to re-update the PUDL dependency based on the big rename. (Unless you think we should take an alternate approach)

rouille commented 6 months ago

I ran the pipeline successfully. Several warnings were threw by pandas that I list here for reference:

data_cleaning.py:844: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gen_capacity.groupby(agg_keys, dropna=False)["capacity_mw"].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)

load_data.py:223: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '<DatetimeArray>
['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00',
 ...
 '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00']
Length: 8760, dtype: datetime64[ns]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.

gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)

impute_hourly_profiles.py:233: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'residual_profile_filtered' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.

singularity-energy / open-grid-emissions