singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

CEMS data cleaning is dropping some CEMS data #279

Closed grgmiller closed 1 year ago

grgmiller commented 1 year ago

When looking into BA-level discrepancies between eGRID2021 data and OGE data, I noticed that the total CO2 emissions for the SOCO balancing authority were about 20% lower than the total reported in eGRID. Comparing plant-level results within SOCO, I noticed that several plants, including "Barry" (ID 3) had substantially less emissions reported in OGE than eGRID. Comparing unit-level CO2 totals for Barry in eGRID2021 and our cleaned CEMS file revealed that our CO2 totals for some units were lower than those in eGRID. These data were both coming from CEMS, so it was puzzling why the totals would be different. I tried loading the "raw" CEMS data from pudl (ie before we apply any data cleaning steps in the OGE pipeline), and these totals matched those in eGRID, which suggests that there is an issue somewhere in data_cleaning.clean_cems() that may be dropping or altering data in an unexpected way.

After stepping through the data cleaning process, it appears that there may be several separate issues:

  1. During emissions.fill_cems_missing_co2() there are several unexpected things happening: 1) co2 emissions are being added to unit 1, which has 0 fuel consumption or generation. 2) The "filled" emissions for unit 7B are now lower than the original emissions. This function performs several operations based on the index of the dataframe, and I am wondering if the index values are shifting at some point.
  2. Then, data_cleaning.remove_cems_with_zero_monthly_data() is removing co2 data (but not generation or fuel data), likely from those records for which co2 was added for hours of zero fuel consumption in the previous step.
grgmiller commented 1 year ago

It turns out that when you merge two dataframes together, the original index of the left dataframe is not preserved unless you are merging on the index. In emissions.fill_cems_missing_co2(), when we were merging emisison factors into the missing_co2 dataframe, the index was being reset. Then when we updated the cems dataframe using the updated values, since pd.update matches based on the index, the co2 values were overwriting data for the incorrect plant/hour. This was likely affecting hundreds of thousands of rows of data (out of several million) so this bug was having a pretty substantial impact - basically overwriting hourly co2 data with co2 data from other plants and hours.

How can we prevent this from happening again? Whenever we are using pd.update or merging on indexes, we need to make sure that the index of the right dataframe has not been accidentally reset. One way to do this is flag if the right dataframe contains an index value of 0 and an index value of len(df)-1, which indicates that the index has been reset. More generally, in the CEMS data cleaning pipeline, we can run a check that ensures that the output co2 total value for each plant is >= the raw input value (the output value might be greater due to data filling).

grgmiller commented 1 year ago

This is what the original BA-level validation against eGRID looked like:

    net_generation_mwh  fuel_consumed_mmbtu fuel_consumed_for_electricity_mmbtu co2_mass_lb generated_co2_rate_lb_per_mwh   num_plants
SOCO    0.00    0.01    0.01    -0.19   -0.19   6.0
HST 0.01    0.00    0.00    -0.13   -0.14   0.0
PGE 0.23    0.09    0.09    0.12    -0.08   0.0
IID 0.00    0.00    0.00    -0.08   -0.08   0.0
HECO    0.19    0.18    0.17    0.14    -0.04   11.0
CISO    0.04    -0.00   -0.01   0.01    -0.03   7.0
PACW    0.00    -0.01   -0.01   -0.02   -0.02   0.0
FMPP    0.01    0.00    0.00    0.00    -0.01   0.0
PNM 0.01    0.00    0.00    0.00    -0.01   0.0
PACE    0.00    -0.01   -0.01   -0.00   -0.00   0.0
NBSO    0.00    0.00    -0.07   0.00    0.00    0.0
SCEG    0.01    0.01    -0.01   0.01    0.00    0.0
ERCO    0.01    0.01    -0.00   0.01    0.00    4.0
JEA 0.00    0.00    0.01    0.00    0.00    0.0
NYIS    0.01    0.01    -0.00   0.02    0.01    19.0
BANC    0.00    0.01    -0.01   0.01    0.01    0.0
WALC    0.00    0.00    0.00    0.01    0.01    0.0
CPLE    0.00    0.02    0.02    0.02    0.01    1.0
BPAT    0.00    0.00    0.00    0.01    0.01    0.0
ISNE    0.00    0.01    0.00    0.01    0.01    -14.0
PJM 0.01    0.01    0.01    0.02    0.02    3.0
FPC 0.00    0.00    0.00    0.02    0.02    0.0
DUK 0.00    0.01    0.01    0.02    0.02    2.0
MISO    0.01    0.02    0.00    0.03    0.02    10.0
NEVP    0.00    0.03    0.03    0.03    0.02    1.0
TEPC    -0.23   -0.27   -0.27   -0.21   0.03    0.0
TVA 0.00    -0.00   -0.00   0.03    0.03    0.0
TEC 0.01    0.06    0.06    0.06    0.04    0.0
PSEI    0.00    0.04    0.02    0.05    0.05    0.0
AZPS    0.00    0.07    0.07    0.05    0.05    0.0
SRP 0.07    0.17    0.17    0.15    0.08    3.0
TAL 0.00    0.09    0.09    0.09    0.09    0.0
AEC 0.00    0.00    0.00    0.27    0.27    0.0
AKMS    NaN NaN NaN NaN NaN NaN
HIMS    NaN NaN NaN NaN NaN NaN
RIMS    NaN NaN NaN NaN NaN NaN
Total   0.01    0.01    0.01    0.01    inf NaN