Enforce static subplant_id across years

grgmiller commented 3 months ago

Purpose

As discovered in https://github.com/singularity-energy/open-grid-emissions/pull/351, the subplant_id assigned to each (plant_id_eia, generator_id) does not remain static across each year of OGE data. This is an issue because we would like to use subplant_id as a primary key to compare data across multiple years.

This PR updates the process of creating subplant IDs to try to enforce static subplant IDs. The changes in this PR enforce static subplant_ids within a single data release version of OGE, although the subplant_ids may still change from version to version (although these changes from version to version represent small edge cases.

Fixes CAR-3872

What the code is doing

How do we enforce static subplant_id within each version? In the past, subplant ids were generated independently for each data year, using CEMS IDs for (data year - 5 -> data year) (eg 2017-2022), and using EIA generator data from the single data year (ie 2022). In this PR, we now use a fixed set of years to load data from, regardless of the data year. We introduce two new constants: earliest_data_year, which equals 2005, and latest_validated_year, which in the case of v0.3.x is 2022. Now, regardless of the data year being run, we always load CEMS IDs that existed in the reported data from 2005-2022, and the EIA generators that existed in the data from 2005-2022. Although data exists in PUDL going back to 2001, the data prior to 2004 in EIA-860 was in a different format, and it appears that a different set of generator_ids was used in this early data which is not consistent with the generator_ids used in later data. These IDs from 2005 on appear to be consistent. Thus, 2005 represents the earliest historical year for which we would ever attempt to calculate OGE data if/when we expand historical coverage.

How do we determine a non-changing subplant order? In the previous version, we assigned subplant_id based on sorting all generators by ["plant_id_eia", "generator_id"]. The problem with this approach is that new generators are not already built in numerical order. For example, a plant may have generators 1 and 2, and then decide to built a generator they call 1A. The existing sorting regime would have changed the subplant_id for generator 2. We now attempt to mitigate this by sorting by various dates, which should remain static. I experimented with several sorting schemes, and the one that seemed to be the most stable was to (after sorting by plant_id_eia) sort by the year that data was first reported to EIA (so that generators that didn't report date until recently always get sorted last), then by the generator's commercial operating date (so that newer generators get sorted last), and then by generator ID. This means that generator "1" may not always get subplant ID 1 if it not report data until recently or if it was built after generator "2".

Dealing with proposed plants Because construction timelines for proposed plants shift over time, or sometimes proposed plants get cancelled, or sometimes proposed generators get a different plant_id_eia assigned once they are actually built, proposed generators have the greatest risk of having non-static subplant_id across versions. While excluding proposed plants may result in more static subplant_ids, sometimes proposed generators start reporting data before their official COD, or they have come online between the most recent data year and the current date. Thus, we need to include some proposed generators. To mitigate the risk of changing subplant_ids, we only include proposed generators that are currently under construction, but not generators under earlier phases of approvals.

Why won't subplant_ids be static across versions? There are several reasons for this. The first is that we are still working within an incomplete mapping of combined cycled plant parts, and as our understanding about these linkages improve over time, we will need to update the subplant_ids (e.g. if we learn that two generators that we previously thought to be unconnected are actually connected together with an improved power sector data crosswalk. Another reason is that over time, generators that exist but never reported to EIA may start reporting data, or a generator that was under construction gets abandoned/cancelled, or if the timeline for construction of multiple units gets switched. The inconsistencies due to this latter reason affect only a tiny portion of the data, but

Consistency of subplant_id with pudl: PUDL also assigns a subplant_id to data, but these are not meant to be static, and we have also introduced additional mapping steps. Over time, perhaps we could make these two methods consistent with each other.

zero indexing: While previously, all subplant_id were zero indexed (ie started with 0 instead of 1), this PR no longer zero-indexes these IDs and instead starts with 1. This is partially to distinguish these subplant_id from the ones in PUDL, and also because zero-indexed numbers are less intuitive for customer-facing applications.

Testing

Development of this updated pipeline occured in the sandbox.ipynb notebook, and I tested each piece step by step. It was through this testing that I learned issues such as:

The existing plant_parts_eia table we were using from pudl does not contain a complete set of the unit_id_pudl that we were using for part of the subplant identification, resulting in incomplete subplant linkages
Some generators that have existed for a long time did not starting reporting data until a couple of years ago, which threw off the original subplant sorting by start date, like (1128, 6), (63580, CAT15)
Our existing connect_ids code had some bugs that were affecting the mapping
When using groupby, we need to include the sort=False argument to keep the data in index order, otherwise groupby sorts by the groupby keys
pd.factorize() is a better way to assign unique numerical IDs to groups rather than .ngroup()
For units that had missing generator_id (for example because the data exists only in EPA but not EIA, these were getting assigned a single subplant_id, because the code was treating the NAs as a single value
The pudl tables that we were loading are missing some generator operating dates and unit codes (used to create unit_id_pudl), so we need to supplement with data loaded from the raw EIA-860 tables

To evaluate how static the IDs are across versions, I ran the subplant pipeline as if the latest_validated_year was 2019, and again as if it were 2022 and compared the results. Out of the ~37,500 records in the subplant map, only 24 records had sybplant IDs that changed from year to year. Most of these changes were due to new combined cycle linkage data becoming available.

[ ] Test run the pipeline for multiple years

Where to look

load_data and data_cleaning have most of the substantive changes for this subplant pipeline in them
I did create a new work_in_progress notebook to hold some in progress code that had previously been in gross_to_net_generation. This should not be reviewed.

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Review estimate

30+ min

Future work

Before Merging this PR, I need to:

[ ] Move the subplant code to its own module (I wanted to wait until this underwent a first round of review to make it easy to compare changes)

Checklist

[x] Update the documentation to reflect changes made in this PR
[ ] Format all updated python files using black
[ ] Clear outputs from all notebooks modified
[ ] Add docstrings and type hints to any new functions created

grgmiller commented 3 months ago

So after running the full pipeline for all years over the weekend, I found that:

The subplant crosswalks are now identical for all years (as expected)
There are some places where the check for complete subplant_id coverage is now failing in years < 2022. I am looking into this and will have a fix ready soon

rouille commented 3 months ago

So after running the full pipeline for all years over the weekend, I found that:

* The subplant crosswalks are now identical for all years (as expected)

* There are some places where the check for complete subplant_id coverage is now failing in years < 2022. I am looking into this and will have a fix ready soon

Is that the same list for all years < 2022?

grgmiller commented 3 months ago

Is that the same list for all years < 2022?

No, this is a slightly different list for each year. It looks like there were three primary causes for these warnings:

The generator has an incorrect retirement date introduced in later years. For example, plant 10466 GEN3 was on standby operation for a number of years, but then suddenly in 2020 was listed as retired with a retirement date of 1999. This meant the generator was being excluded from our subplant crosswalk by the code. I have introduced a fix for this
There are plants that are under construction in 2019, but then are cancelled or disappear from the data sometime between 2019 and 2022. We could fix this by including any plants that are not cancelled as of earliest_data_year.
Some of these warnings were flagging generators that were intentionally not included in the subplant crosswalk because they were never built, but were included in the dataframe we were checking because we failed to filter them out. I've introduced some fixes for this that only check whether the subplant ID is missing for generators where data exists in our main datafame.

singularity-energy / open-grid-emissions