Closed grgmiller closed 6 months ago
@rouille Ok the pipeline successfully ran on my computer, so we should be good to go.
HOWEVER, I just learned from the folks at Catalyst that the table names/columns are going to change again next week with a big rename PR that will be going through. See: The rename PR PUDL data dictionary from the rename PR docs build
So this PR will make things current to the 2023.12.01 pudl data release, but not the ultimate format that will be released next week.
I'm going to propose that we go ahead and merge this PR so that we can keep working on other issues, and then I can start a new PR after next week to re-update the PUDL dependency based on the big rename. (Unless you think we should take an alternate approach)
I ran the pipeline successfully. Several warnings were threw by pandas
that I list here for reference:
data_cleaning.py:844: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
gen_capacity.groupby(agg_keys, dropna=False)["capacity_mw"].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)
data_cleaning.py:737: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
primary_fuel_calc.groupby(agg_keys, dropna=False)[source].transform(max)
load_data.py:223: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '<DatetimeArray>
['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 00:00:00',
...
'2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00', '2021-12-01 00:00:00']
Length: 8760, dtype: datetime64[ns]' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
gross_to_net_generation.py:518: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
factors_to_use.groupby(["plant_id_eia", "data_source"], dropna=False)
impute_hourly_profiles.py:233: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'residual_profile_filtered' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
Purpose
This PR updates the OGE pipeline to work with the latest version of the pudl databases and code. As summarized in https://github.com/singularity-energy/open-grid-emissions/issues/310, there have been many major changes to pudl in the past year.
Fixes CAR-2966, CAR-2969, CAR-2902
Closes #310
What the code is doing
In
pipfile
:singularity-energy/pudl
. This was forked frompudl/main
based on the 2023.12.01 release version of pudl.In
download_data
:download_pudl_data()
was updated to allow pudl database files to be downloaded either from zenodo or aws.data/downloads/pudl
(which is where the zenodo files were previously downloaded), and zenodo files are downloaded todata/downloads/pudl_zenodo
In
load_data
:pd.read_parquet()
instead of having to load and concat each state-year file separately.load_pudl_table()
and replaceinitialize_pudl_out()
with a newload_pudl_table()
function. We no longer have to use apudl_out
object to extract tables from the pudl database, since all tables are now saved in final form in the database. This function now reads the data directly with various options for which years and columns to load.In
data_cleaning
:pudl.analysis.allocate_gen_fuel.allocate_gen_fuel_by_generator_energy_source()
in our pipeline, we now load the output of this function from the pudl database.In
emissions
:Testing
Ran the entire data pipeline for 2020.
Where to look
The major substantial changes occur in
download_data
andload_data
.data_cleaning
andemissions
also got substantial changes.Otherwise, most of the changes in other files are just changing all of the
pudl_out
references to the newload_pudl_table()
function.The manual data files containing emissions factors for nox and so2 were updated use use pudl-formatted firing types, and replace "N/A" values in the table with "none" to help with matching.
Ignore
sandbox.ipynb
In
column_checks.py
, several column names were updated and the dtypes dict was sorted alphabetically by key.Usage Example/Visuals
N/A
Review estimate
20-30 min
Future work
Checklist
black