Update partial CEMS shaping for mixed fuel plants

singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids

MIT License

75 stars 5 forks source link

Update partial CEMS shaping for mixed fuel plants #238

Closed gailin-p closed 2 years ago

gailin-p commented 2 years ago

Summary of changes

Close #230 by changing hourly data source identification methods:

A renewable generator (hydro, wind, solar, nuclear, geothermal) won't use the partial_cems_plant hourly data source methodology
A subplant that contains generators of mixed fuel types will choose the hourly data source of the generator with the largest generation (this is relevant for subplants with subplant_id=NaN, which can have generators of mixed fuel types; see comments on #49)

The post-fix hourly emission rate for PJM is below. The abrupt drops to zero (seen in #230) are gone.

The PJM nuclear emission rate, below, shows the three hours where the generator is on as higher-than-zero emission rates:

Open question: subplants in hourly `plant_data` results

Plant 2410, the nuclear plant originally responsible for the PJM data issues, has one subplant made up of a CEMS-reporting diesel generator and one subplant with its nuclear generation. Generation from the CEMS-reporting subplant is in plant_data/hourly/individual_plant_data.csv, while generation from the nuclear subplant is in shaped_fleet_data.csv. This is confusing, since a user looking only at individual_plant_data.csv would think that plant 2410 had only ~10 MWh of generation in 2020, but if they looked at the annual data, they'd see annual generation of 16,000,000 MWh.

Proposed solutions:

Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia
Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

I think 1 is the better solution, but it does mean removing some hourly data that we're actually pretty confident in. (eg, for 2410, we do know that the diesel generator ran for those 3 hours).

gailin-p commented 2 years ago

Other changes:

There are some notebook changes included here that are not directly related to #230:

hourly_validation.ipynb and map_visualization.ipynb are updated to reflect final changes I made before publishing the visualizations in the two blog posts (announcement and real-time validation posts)
work_in_progress/clean_cems_outliers.ipynb is a notebook I pulled out of the old/outdated gailin/clean_cems branch. This notebook is background research to advance #50. Moving the notebook into work_in_progress will allow us to delete the outdated gailin/clean_cems branch.

grgmiller commented 2 years ago

Proposed solutions:

Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia

Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

So the original partial plant methodology was designed such that different parts of a plant couldn't end up in separate files: If all subplants had hourly_data_source=="eia" all of the plant data would end up in shaped_fleet_data, but if only subplants had eia as the data source, we would use one of the partial cems methods, and then all of the data for that plant would end up in individual_plant_data (or at least that was the intent - if you've found that's not how it was working then we would want to fix that bug).

However, with this current issue, we've discovered that we don't always want to shape an entire plant with partial cems data because of the spike issue. Currently, this PR now is splitting up the data so that (in the case of 2410 for example) the nuclear portion would end up in the shaped fleet data and the diesel portion would end up in the individual data.

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

I'm kind of leaning toward 2 as the simpler fix for now.

There are also a couple of other options that we could consider:

Instead of using "NA" subplant ids, use the unit_id_pudl and or generator id to assign a non-na value to these subplants
Publish all of the data as individual plant data

gailin-p commented 2 years ago

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

The new methodology proposed in this PR results in 106 plant-months (over 12 unique plants) with split EIA and CEMS/partial_CEMS hourly data sources. The generation from these plants is split pretty evenly over CEMS and EIA hourly data sources, with about 21,000,000 MWh generation in CEMS subplants and 29,000,000 in EIA subplants.

The affected plants are: [557, 621, 645, 1355, 2410, 2707, 2953, 6074, 8223, 10029, 10823, 58236]. (clarification note: these are plants with split methodologies between subplants, which is a different set of plants than the plants with NaN-id subplants with split methodologies within a subplant.)

gailin-p commented 2 years ago

725e39a added a new hourly plant-level output file, partial_plant_data.csv, which contains the CEMS and partial CEMS plant data for plants with one or more subplants without hourly data.

grgmiller commented 2 years ago

The more I think about it, I'm thinking that perhaps we should revert https://github.com/singularity-energy/open-grid-emissions/pull/238/commits/725e39a76c234d9876d491817160d58968a5afb5 and just keep the outputs split among the two files with a disclaimer in the documentation (and maybe even on the download page) that data might be split between two files. Here's my thought process:

I think that maybe in the next release, we will actually want to move to publishing all of the hourly plant-level data for individual plants, instead of being split into two files. In talking to some of the early adopters of the OGE dataset, it seems that there is interested in the individual plant data at the hourly resolution. In the past, we haven't done this partially due to memory issues, but I think there may be a way around this: if we shape the individual plant data in chunks only during export, and do not actually store a full version of the shaped plant-level data in memory, I think this could work. We'd want to export the data in chunks (eg for each BA, or each state) so that each file isn't huge. Once we export the hourly plant level data, we could apply the existing aggregated shaping method since that's all we need for the subsequent power sector and consumed outputs.
For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.
It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

gailin-p commented 2 years ago

It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

Most of the reorganization is actually unrelated -- I just moved the writing of eia923 to just after the data is finished because I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172. The only other pipeline change is to not delete it so it can be passed to the plant writing function.

For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.

I see your point here, though it doesn't feel too problematic to me (it'll still get zipped up with the rest of the hourly plant data).

I think if we want to bundle the partial hourly plant data with the regular plant data, we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial. There's no other way to get that info without the outputs files. (We should maybe also copy that file to the plant_data/hourly folders so users get that information without having to search out and download another zip)

grgmiller commented 2 years ago

I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172

Oh sorry about that - I realize now that in a previous iteration of the code the eia923_allocated df was modified by the partial CEM shaping process, but that is no longer the case, so your move makes total sense!

we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

gailin-p commented 2 years ago

Ok, I like the idea of re-combining the hourly plant-level output files and directing users to plant_metadata to identify which plants have hourly data split between the plant file and the synthetic plant file.

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

Great point! This seems like a good spot for this data. However, plant_metadata.csv will need to be modified, because it doesn't currently contain rows for EIA-only plants (only rows for the synthetic plants aggregated from many EIA plants). For example, using the classic example plant for this issue, 2410 (the PJM nuclear plant with the diesel backup): we see metadata rows for the three months when the diesel subplant reported to CEMS, but no plant_metadata.csv row for the nuclear subplant -- it's just grouped in the synthetic plant row.

Two options for fixing:

Add rows only for shaped subplants of plants with some individual plant data (ie, the plant-months currently getting written to partial_plant_data.csv
Add rows for all subplant-months. This would make for a much larger file, but it guarantees that a user will be able to look up where the hourly data is for any plant they're interested in

grgmiller commented 2 years ago

I think either option 1 or 2 would work, and we should do whatever is easiest to implement at this point (based on my educated guess, 2 might be easier since it involves less segmenting of the data, but I could be mistaken). The plant_static_attributes file also contains the mapping of plant id to shaped plant id, so in the worst case, a user would have to cross reference the two tables to figure out the metadata for a plant that had been aggregated.

gailin-p commented 2 years ago

07853b7 Removed separated full and partial plant-level hourly outputs Renamed metadata id to "<See shaped plant ID">

singularity-energy / open-grid-emissions

Update partial CEMS shaping for mixed fuel plants #238

Summary of changes

Open question: subplants in hourly plant_data results

Open question: subplants in hourly `plant_data` results