Handling annually-reported data in EIA-923

grgmiller commented 1 year ago

Each plant can report data to EIA on three different frequencies:

respondent_frequency	Description
A	The respondent only provides an annual total(s) for this record via the EIA-923 annual survey form. Any monthly data in this record is estimated based on the respondent's reported annual total(s) and power plants with similar characteristics to this plant.
AM	The respondent provides monthly values for this record, but does so once per year via the EIA-923 annual survey form.
M	The respondent provides monthly values for this record and does so via the EIA-923 monthly survey form.

This means that data for plants that respond at the A frequency may not reflect the actual monthly value. This has implications for several parts of our data pipeline:

When allocating EIA-923 data using pudl. It appears that EIA imputes monthly values for data in the gf table, but not for data in the gen or bf tables. This could lead to incorrect allocations
When calculating gross-to-net regression values (or monthly conversion factors) we would only want to use data for plants that report at the M or AM frequency
When identifying the source of hourly data, we need to be careful when using partial-year data from EIA-923 for plants that report at the A frequency, since the monthly values for the months we are filling might not reflect the actual monthly data.

Issue 3 is tricky, since it could lead to potential double-counting or under-counting of data for subplants that only report part of the year to CEMS. One way to approach this would be to calculate what percent of the annual total reported in EIA-923 the partial CEMS data acounts for, and then evenly allocate the difference between the two to each month that is not reported.

grgmiller commented 1 year ago

Based on the 2020 EIA-923 data, net generation and fuel consumption reported at the A frequency account for 10% of the total reported generation and fuel.

grgmiller commented 1 year ago

After exploring this in more detail, here are some summary statistics about the annually reported data:

For 2020, category	fuel consumption	net generation	co2 mass
annual as % of total reported EIA-923 data	9.7%	9.7%	3.1 %
annual used in final results as % of total reported EIA-923	8.7%	8.9%	1.9%
annual used for partial year as % of total reported EIA-923	0.7%	0.5%	0.9%

So (from the perspective of fuel consumption) annually-reported data makes up about 10% of all EIA-923 data. However, because for some months we have CEMS data available, annually-reported EIA data makes up about 9% of our final results (whereas monthly EIA data makes up about 39% and CEMS data makes up about 52%).

This means about 9% of the final data may have lower quality temporal allocation due to the fact that EIA is imputing the month to which the annual data belongs. But on an annual level, there should be no issue with this data.

However, any potential double-counting or under-counting of data would result from instances where we use CEMS data for part of the year, and annually-reported EIA data for part of the year, if the annual data wasn't correctly allocated to each month. However, such data only makes up 0.7% of the final data.

Given that this last category makes up such a small percentage of the data, we will want to add this to the list of data caveats, but I'm not sure if we need to prioritize a fix before the initial public release. However, this should be a data quality metric that we track as part of the pipeline.

grgmiller commented 1 year ago

For the initial public release, I'm just adding output metrics that describe the data quality in this regard. In the future, we should determine how we want to handle this more thoroughly

grgmiller commented 1 year ago

We may also want to consider re-imputing the monthly shape of annual data instead of trusting the monthly imputation done by EIA.

singularity-energy / open-grid-emissions

Handling annually-reported data in EIA-923 #170