singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Update PUDL Dependency to v2023.12.01 #318

Closed grgmiller closed 6 months ago

grgmiller commented 7 months ago

Purpose

This PR updates the OGE pipeline to work with the latest version of the pudl databases and code. As summarized in https://github.com/singularity-energy/open-grid-emissions/issues/310, there have been many major changes to pudl in the past year.

Fixes CAR-2966, CAR-2969, CAR-2902

Closes #310

What the code is doing

In pipfile:

In download_data:

In load_data:

In data_cleaning:

In emissions:

Testing

Ran the entire data pipeline for 2020.

Where to look

The major substantial changes occur in download_data and load_data. data_cleaning and emissions also got substantial changes.

Otherwise, most of the changes in other files are just changing all of the pudl_out references to the new load_pudl_table() function.

The manual data files containing emissions factors for nox and so2 were updated use use pudl-formatted firing types, and replace "N/A" values in the table with "none" to help with matching.

Ignore sandbox.ipynb

In column_checks.py, several column names were updated and the dtypes dict was sorted alphabetically by key.

Usage Example/Visuals

N/A

Review estimate

20-30 min

Future work

Checklist

grgmiller commented 6 months ago

@rouille Ok the pipeline successfully ran on my computer, so we should be good to go.

HOWEVER, I just learned from the folks at Catalyst that the table names/columns are going to change again next week with a big rename PR that will be going through. See: The rename PR PUDL data dictionary from the rename PR docs build

So this PR will make things current to the 2023.12.01 pudl data release, but not the ultimate format that will be released next week.

I'm going to propose that we go ahead and merge this PR so that we can keep working on other issues, and then I can start a new PR after next week to re-update the PUDL dependency based on the big rename. (Unless you think we should take an alternate approach)

rouille commented 6 months ago

I ran the pipeline successfully. Several warnings were threw by pandas that I list here for reference:

data_cleaning.py:844: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  gen_capacity.groupby(agg_keys, dropna=False)["capacity_mw"].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)
load_data.py:223: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '<DatetimeArray>
['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00',
 ...
 '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00']
Length: 8760, dtype: datetime64[ns]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
impute_hourly_profiles.py:233: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'residual_profile_filtered' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.