singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Enforce static subplant_id across years #353

Closed grgmiller closed 3 months ago

grgmiller commented 3 months ago

Purpose

As discovered in https://github.com/singularity-energy/open-grid-emissions/pull/351, the subplant_id assigned to each (plant_id_eia, generator_id) does not remain static across each year of OGE data. This is an issue because we would like to use subplant_id as a primary key to compare data across multiple years.

This PR updates the process of creating subplant IDs to try to enforce static subplant IDs. The changes in this PR enforce static subplant_ids within a single data release version of OGE, although the subplant_ids may still change from version to version (although these changes from version to version represent small edge cases.

Fixes CAR-3872

What the code is doing

How do we enforce static subplant_id within each version? In the past, subplant ids were generated independently for each data year, using CEMS IDs for (data year - 5 -> data year) (eg 2017-2022), and using EIA generator data from the single data year (ie 2022). In this PR, we now use a fixed set of years to load data from, regardless of the data year. We introduce two new constants: earliest_data_year, which equals 2005, and latest_validated_year, which in the case of v0.3.x is 2022. Now, regardless of the data year being run, we always load CEMS IDs that existed in the reported data from 2005-2022, and the EIA generators that existed in the data from 2005-2022. Although data exists in PUDL going back to 2001, the data prior to 2004 in EIA-860 was in a different format, and it appears that a different set of generator_ids was used in this early data which is not consistent with the generator_ids used in later data. These IDs from 2005 on appear to be consistent. Thus, 2005 represents the earliest historical year for which we would ever attempt to calculate OGE data if/when we expand historical coverage.

How do we determine a non-changing subplant order? In the previous version, we assigned subplant_id based on sorting all generators by ["plant_id_eia", "generator_id"]. The problem with this approach is that new generators are not already built in numerical order. For example, a plant may have generators 1 and 2, and then decide to built a generator they call 1A. The existing sorting regime would have changed the subplant_id for generator 2. We now attempt to mitigate this by sorting by various dates, which should remain static. I experimented with several sorting schemes, and the one that seemed to be the most stable was to (after sorting by plant_id_eia) sort by the year that data was first reported to EIA (so that generators that didn't report date until recently always get sorted last), then by the generator's commercial operating date (so that newer generators get sorted last), and then by generator ID. This means that generator "1" may not always get subplant ID 1 if it not report data until recently or if it was built after generator "2".

Dealing with proposed plants Because construction timelines for proposed plants shift over time, or sometimes proposed plants get cancelled, or sometimes proposed generators get a different plant_id_eia assigned once they are actually built, proposed generators have the greatest risk of having non-static subplant_id across versions. While excluding proposed plants may result in more static subplant_ids, sometimes proposed generators start reporting data before their official COD, or they have come online between the most recent data year and the current date. Thus, we need to include some proposed generators. To mitigate the risk of changing subplant_ids, we only include proposed generators that are currently under construction, but not generators under earlier phases of approvals.

Why won't subplant_ids be static across versions? There are several reasons for this. The first is that we are still working within an incomplete mapping of combined cycled plant parts, and as our understanding about these linkages improve over time, we will need to update the subplant_ids (e.g. if we learn that two generators that we previously thought to be unconnected are actually connected together with an improved power sector data crosswalk. Another reason is that over time, generators that exist but never reported to EIA may start reporting data, or a generator that was under construction gets abandoned/cancelled, or if the timeline for construction of multiple units gets switched. The inconsistencies due to this latter reason affect only a tiny portion of the data, but

Consistency of subplant_id with pudl: PUDL also assigns a subplant_id to data, but these are not meant to be static, and we have also introduced additional mapping steps. Over time, perhaps we could make these two methods consistent with each other.

zero indexing: While previously, all subplant_id were zero indexed (ie started with 0 instead of 1), this PR no longer zero-indexes these IDs and instead starts with 1. This is partially to distinguish these subplant_id from the ones in PUDL, and also because zero-indexed numbers are less intuitive for customer-facing applications.

Testing

Development of this updated pipeline occured in the sandbox.ipynb notebook, and I tested each piece step by step. It was through this testing that I learned issues such as:

To evaluate how static the IDs are across versions, I ran the subplant pipeline as if the latest_validated_year was 2019, and again as if it were 2022 and compared the results. Out of the ~37,500 records in the subplant map, only 24 records had sybplant IDs that changed from year to year. Most of these changes were due to new combined cycle linkage data becoming available.

image

Where to look

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Review estimate

30+ min

Future work

Before Merging this PR, I need to:

Checklist

grgmiller commented 3 months ago

So after running the full pipeline for all years over the weekend, I found that:

image
rouille commented 3 months ago

So after running the full pipeline for all years over the weekend, I found that:

* The subplant crosswalks are now identical for all years (as expected)

* There are some places where the check for complete subplant_id coverage is now failing in years < 2022. I am looking into this and will have a fix ready soon
image

Is that the same list for all years < 2022?

grgmiller commented 3 months ago

Is that the same list for all years < 2022?

No, this is a slightly different list for each year. It looks like there were three primary causes for these warnings:

  1. The generator has an incorrect retirement date introduced in later years. For example, plant 10466 GEN3 was on standby operation for a number of years, but then suddenly in 2020 was listed as retired with a retirement date of 1999. This meant the generator was being excluded from our subplant crosswalk by the code. I have introduced a fix for this
  2. There are plants that are under construction in 2019, but then are cancelled or disappear from the data sometime between 2019 and 2022. We could fix this by including any plants that are not cancelled as of earliest_data_year.
  3. Some of these warnings were flagging generators that were intentionally not included in the subplant crosswalk because they were never built, but were included in the dataframe we were checking because we failed to filter them out. I've introduced some fixes for this that only check whether the subplant ID is missing for generators where data exists in our main datafame.