singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
72 stars 5 forks source link

Update subplant mapping #239

Closed grgmiller closed 1 year ago

grgmiller commented 2 years ago

This PR closes #49 This PR advances #230

Background

The primary purpose of this PR is to fix several issues related to subplant_ids:

These two general issues were leading to specific downstream issues in the data pipeline:

How does this PR fix these issues?

One way to approach fixing this issue would be to actually update the pudl analysis module that creates the subplant_id based on graph analysis. At this time, doing so is not possible until the next versioned release of pudl is available. Instead, we perform some post-processing steps in our code that expand and correct the assigned subplant_ids.

  1. Ensure that all generator_id and unitid are included in the crosswalk table. Previously, the subplant crosswalk table only included units/generators for which there was an identified match in the EPA power sector crosswalk table. However, there are plenty of plants that don't report to CEMS that could have a subplant_id assigned. We now include observations for all units that report to CEMS and all generators that report to EIA.
  2. Use boiler-generator associations to complete/update subplant_id mappings. Similar to the graph analysis pudl uses to identify subplant_id, pudl also uses graph analysis to identify clusters of boilers and generators, to which it assigns a unit_id_pudl. We merge these unit_id_pudl into the subplant crosswalk and use these to update the subplant_id

This is the description of what the update_subplant_id() function does:

Data Preparation
Because the existing subplant_id crosswalk was only meant to map CAMD units to EIA generators, it
is missing a large number of subplant_ids for generators that do not report to CEMS. Before applying this
function to the subplant crosswalk, the crosswalk must be completed with all generators by outer
merging in the complete list of generators from EIA-860 (specifically the gens_eia860 table from pudl).
This dataframe also contains the complete list of `unit_id_pudl` mappings that will be necessary.

This function follows several steps:
1. Because the current subplant_id code does not take boiler-generator associations into account,
        there may be instances where the code assigns generators to different subplants when in fact, according
        to the boiler-generator association table, these generators are grouped into a single unit based on their
        boiler associations. The first step of this function is thus to identify if multiple subplant_id have
        been assigned to a single unit_id_pudl. If so, we replace the existing subplant_ids with a single subplant_id.
        For example, if a generator A was assigned subplant_id 0 and generator B was assigned subplant_id 1, but
        both generators A and B are part of unit_id_pudl 1, we would re-assign the subplant_id to both generators to
        0 (we always use the lowest number subplant_id in each unit_id_pudl group). This may result in some subplant_id
        being skipped, but this is okay because we will later renumber all subplant ids (i.e. if there were also a
        generator C with subplant_id 2, there would no be no subplant_id 1 at the plant).
        Likewise, sometimes multiple unit_id_pudl are connected to a single subplant_id, so we also correct the 
        unit_id_pudl basedon these connections.
2. The second issue is that there are many NA subplant_id that we should fill. To do this, we first look at
        unit_id_pudl. If a group of generators are assigned a unit_id_pudl but have NA subplant_ids, we assign a single
        new subplant_id to this group of generators. If there are still generators at a plant that have both NA subplant_id
        and NA unit_id_pudl, we for now assume that each of these generators constitutes its own subplant. We thus assign a unique
        subplant_id to each generator that is unique from any existing subplant_id already at the plant.
       In the case that there are multiple unitid at a plant that are not matched to any other identifiers (generator_id, 
       unit_id_pudl, or subplant_id), as is the case when there are units that report to CEMS but which do not exist in the EIA
       data, we assign these units to a single subplant.

Other updates

TODO

grgmiller commented 2 years ago

I also added a notebook to start exploring issue #240