Closed gailin-p closed 2 years ago
Ok so I went digging specifically into the PJM data and this is what I found:
There are some pretty big spikes in the nuclear net generation data in PJM:
So how is this getting introduced? Through the residual profile shaping. This is because the cems profile for nuclear in PJM includes some large spikes in generation that are getting subtracted from the nuclear generation in EIA-930 (although I thought that we had implemented a check that prevents such shifted profiles from being used)
So how is there nuclear data in CEMS? Nuclear plants don't report data to CEMS! What's happening is that the PSEG Salem Generation Station in New Jersey (plant ID 2410) is a nuclear plant has an oil-fired generator (likely some sort of backup generator) that does report data to CEMS, and seems to have been briefly fired up three times in 2020. When we aggregate the CEMS data for use in the residual calculation, we do this by the plant primary fuel, and not the generator-specific fuel, which means this intermittent petroleum generation at this nuclear plant is treated as nuclear generation.
It seems like this issue might also be relevant to the Turkey Point nuclear plant (id 621) in FPL. This may also affect other fuel types that don't report to CEMS, but may have a backup generator that does (solar, wind, hydro, nuclear).
So how do we fix this? The option that first comes to mind is that for fuel types that don't report to CEMS (solar, wind, hydro, nuclear), we set the CEMS profiles to zero before calculating the residual profile. This would prevent any backup generators from influencing the residual profile. However, we'd want to think carefully about this approach to ensure that we are not double-counting or under counting any data.
It actually looks like there are a number of plants with a primary clean fuel, but which have at least one generator that runs on a non-clean fuel:
I thought that we had implemented a check that prevents such shifted profiles from being used
In impute_hourly_profiles.select_best_available_profile
, it looks like we 1) check residual profiles for negative values and 2) check the shifted profile to not use it if it's greater than EIA 930. It seems like that check isn't working as expected here
we set the CEMS profiles to zero before calculating the residual profile.
I think this would work, and it shouldn't introduce issues with double counting or not counting data, since the residual profile is not used directly as data, only to shape data from EIA-923. We would just have to be sure not to delete the data from the actual CEMS table, only from the version passed to the residual profile calculation While this would work, I think we should only do it if the problem persists after fixing the above bug.
I think another root issue here that I'm just realizing is that even if the CEMS data included some small amount of backup generation from a diesel generator, the magnitude of that should be pretty small and not affect the residual profile that much. Looking at the data here though, the diesel generation spikes are many orders of magnitude higher than the total reported nuclear generation.
This raises the possibility that the source of this issue is actually outlier data in CEMS (https://github.com/singularity-energy/open-grid-emissions/issues/50)
Zooming in on the eia930 profile for nuclear, I'm also noticing something else funny, which is that the nuclear profile seems to be pretty variable:
However, looking at the reported EIA-930 data for nuclear in January 2020 reveals what we would expect: that nuclear should be pretty flat:
I'm thinking that this is likely a result of the physics reconciliation code, and makes me think that we need to adjust the weighting on the reconciliation so that certain fuel types like nuclear can't be adjusted in this way, or go back to using the raw generation data for the residual calculation (https://github.com/singularity-energy/open-grid-emissions/issues/104).
It turns out that this is not an issue with shaping at all -- our filters work correctly. Although the cems_profile
, shifted_residual_profile
, and scaled_residual_profile
are all broken by the extreme CEMS values in the three problem hours in PJM, the selected profile in those hours is eia930_profile. This profile still has the issues @grgmiller noted above (gridemissions
physics-based cleaning has introduced unrealistic variability into the 930 profile for nuclear), it does not have the extreme spikes. The profile
column in outputs/2020/hourly_profiles_2020.csv
supports this conclusion, with no extreme spikes for PJM nuclear.
The spikes are instead introduced directly by the CEMS data from plant 2410. This is supported by the following state of current output and result data:
outputs/2020/shaped_eia923_data_2020.csv
)There are two possible solutions to this:
* CEMS data cleaning. A simple outlier filter would discard these three days. However, we would need to be careful to not discard correct fossil generation from renewable generators, which may still be outliers (if most hours have zero CEMS generation) but may represent real backup generation.
* Discard CEMS data from renewable generators. I think this is the incorrect choice, since the backup generation is real (see plants listed by @grgmiller above). Although I do think this backup generation requires more thought. eg: do our gross-to-net conversion methods work for backup generation from mostly renewable plants? Should net generation ever be positive from a backup generator?
It turns out that the spikes are actually introduced by scaling partial CEMS data; specifically, the partial_cems_plant
category.
The original CEMS data for plant 2410 is fine, showing between 1 and 9 MWh of generation for each of the three hours in 2020 where the backup generator is on:
_Fig 1: CEMS generation (blue) for 2020 for plant 2410. A backup generator appears to be on and reporting to CEMS for a total of 3 hours. This data is from our output/cems_2020.csv
intermediate output file._
However, in our final output, we've scaled all April, May, and June generation from 2410 into those 4 hours, resulting in the following profile:
_Fig 2: Plant 2410 generation from our results/plant_data/individual_plant_data.csv
result file._
Plant 2410 is shaped using the partial_cems_plant
methodology. This category of data includes plants where one or more subplants in a plant report to CEMS and one or more remaining subplants do not. In this case, we shape the non-reporters using the reporter shape. For plant 2410, all of the 2020 nuclear generation appears to be allocated to the three hours with non-zero CEMS generation for the one reporting subplant in 2410.
I think the fix here is to make partial_cems_plant
sensitive to fuel type: subplants of one fuel type should not be shaped if the available CEMS data is from a different fuel type. This change should be made in data_cleaning.identify_partial_cems_plants
.
Side note: I confirmed that this issue does not appear to result in double-counting. After checking our nuclear generation totals against eGRID nuclear generation totals for PJM, it looks like generation from 2410 is not being double-counted.
I think the fix here is to make partial_cems_plant sensitive to fuel type: subplants of one fuel type should not be shaped if the available CEMS data is from a different fuel type. This change should be made in data_cleaning.identify_partial_cems_plants.
Agreed! Let's definitely implement this
Is there a validation check that we could add to the pipeline to automatically screen for spikes like this and alert us?
Some BAs have negative spikes in emission rate data. (negative meaning they deviate negatively from the trend)
Eg, PJM:
WACM:
Both the above figures compare real-time (red) to OGE (blue). Both show adjusted emission rate (CO2/MWh). In the OGE data, the relevant variable is
generated_co2_rate_lb_per_mwh_for_electricity_adjusted
inpower_sector_data
.In some small BAs, rate spikes are expected (eg, DEAA where there is only one
natural_gas
plant and so plant startup leads to enormous positive BA-level emission rate spikes). However, we don't expect spikes that are abrupt decreases in rate (eg WACM), and we don't expect spikes in large BAs (eg, PJM)