Closed grgmiller closed 3 months ago
So after running the full pipeline for all years over the weekend, I found that:
So after running the full pipeline for all years over the weekend, I found that:
* The subplant crosswalks are now identical for all years (as expected) * There are some places where the check for complete subplant_id coverage is now failing in years < 2022. I am looking into this and will have a fix ready soon
![]()
Is that the same list for all years < 2022?
Is that the same list for all years < 2022?
No, this is a slightly different list for each year. It looks like there were three primary causes for these warnings:
Purpose
As discovered in https://github.com/singularity-energy/open-grid-emissions/pull/351, the
subplant_id
assigned to each (plant_id_eia, generator_id) does not remain static across each year of OGE data. This is an issue because we would like to use subplant_id as a primary key to compare data across multiple years.This PR updates the process of creating subplant IDs to try to enforce static subplant IDs. The changes in this PR enforce static subplant_ids within a single data release version of OGE, although the subplant_ids may still change from version to version (although these changes from version to version represent small edge cases.
Fixes CAR-3872
What the code is doing
How do we enforce static subplant_id within each version? In the past, subplant ids were generated independently for each data year, using CEMS IDs for (data year - 5 -> data year) (eg 2017-2022), and using EIA generator data from the single data year (ie 2022). In this PR, we now use a fixed set of years to load data from, regardless of the data year. We introduce two new constants:
earliest_data_year
, which equals 2005, andlatest_validated_year
, which in the case of v0.3.x is 2022. Now, regardless of the data year being run, we always load CEMS IDs that existed in the reported data from 2005-2022, and the EIA generators that existed in the data from 2005-2022. Although data exists in PUDL going back to 2001, the data prior to 2004 in EIA-860 was in a different format, and it appears that a different set of generator_ids was used in this early data which is not consistent with the generator_ids used in later data. These IDs from 2005 on appear to be consistent. Thus, 2005 represents the earliest historical year for which we would ever attempt to calculate OGE data if/when we expand historical coverage.How do we determine a non-changing subplant order? In the previous version, we assigned subplant_id based on sorting all generators by
["plant_id_eia", "generator_id"]
. The problem with this approach is that new generators are not already built in numerical order. For example, a plant may have generators 1 and 2, and then decide to built a generator they call 1A. The existing sorting regime would have changed the subplant_id for generator 2. We now attempt to mitigate this by sorting by various dates, which should remain static. I experimented with several sorting schemes, and the one that seemed to be the most stable was to (after sorting by plant_id_eia) sort by the year that data was first reported to EIA (so that generators that didn't report date until recently always get sorted last), then by the generator's commercial operating date (so that newer generators get sorted last), and then by generator ID. This means that generator "1" may not always get subplant ID 1 if it not report data until recently or if it was built after generator "2".Dealing with proposed plants Because construction timelines for proposed plants shift over time, or sometimes proposed plants get cancelled, or sometimes proposed generators get a different plant_id_eia assigned once they are actually built, proposed generators have the greatest risk of having non-static subplant_id across versions. While excluding proposed plants may result in more static subplant_ids, sometimes proposed generators start reporting data before their official COD, or they have come online between the most recent data year and the current date. Thus, we need to include some proposed generators. To mitigate the risk of changing subplant_ids, we only include proposed generators that are currently under construction, but not generators under earlier phases of approvals.
Why won't subplant_ids be static across versions? There are several reasons for this. The first is that we are still working within an incomplete mapping of combined cycled plant parts, and as our understanding about these linkages improve over time, we will need to update the subplant_ids (e.g. if we learn that two generators that we previously thought to be unconnected are actually connected together with an improved power sector data crosswalk. Another reason is that over time, generators that exist but never reported to EIA may start reporting data, or a generator that was under construction gets abandoned/cancelled, or if the timeline for construction of multiple units gets switched. The inconsistencies due to this latter reason affect only a tiny portion of the data, but
Consistency of subplant_id with pudl: PUDL also assigns a subplant_id to data, but these are not meant to be static, and we have also introduced additional mapping steps. Over time, perhaps we could make these two methods consistent with each other.
zero indexing: While previously, all subplant_id were zero indexed (ie started with 0 instead of 1), this PR no longer zero-indexes these IDs and instead starts with 1. This is partially to distinguish these subplant_id from the ones in PUDL, and also because zero-indexed numbers are less intuitive for customer-facing applications.
Testing
Development of this updated pipeline occured in the
sandbox.ipynb
notebook, and I tested each piece step by step. It was through this testing that I learned issues such as:sort=False
argument to keep the data in index order, otherwise groupby sorts by the groupby keyspd.factorize()
is a better way to assign unique numerical IDs to groups rather than.ngroup()
To evaluate how static the IDs are across versions, I ran the subplant pipeline as if the latest_validated_year was 2019, and again as if it were 2022 and compared the results. Out of the ~37,500 records in the subplant map, only 24 records had sybplant IDs that changed from year to year. Most of these changes were due to new combined cycle linkage data becoming available.
Where to look
load_data
anddata_cleaning
have most of the substantive changes for this subplant pipeline in themgross_to_net_generation
. This should not be reviewed.Usage Example/Visuals
How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).
Review estimate
30+ min
Future work
Before Merging this PR, I need to:
Checklist
black