singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
72 stars 5 forks source link

Missing timestamps at beginning and end of consumed emissions output #193

Open grgmiller opened 2 years ago

grgmiller commented 2 years ago

A handful of hours at the beginning and end of each consumed emissions output file contains missing values.

Because the consumed emissions matrix uses utc datetime, but the input power_sector_data uses local time to define the year, so there are timestamps at the beginning and end where the consumed emission matrix is missing some values and so can't be solved. Even if I used local time to define the start/end instead, there would still be some missing hours, because 2020 starts for east coast BAs before west coast BAs (and vice versa at the end of the year).

Ideally, we want to ensure that we have a complete datetime series for all 8760/8784 hours of the year. If there is missing data, this will not be useful to data users.

Having an entire year of consumed emissions will require having 6 hours more than a year (3 extra hours at the beginning of the year for western BAs, 3 extra hours at the end of the year for eastern BAs) of produced emissions data fed into the consumed emissions code.

The code currently uses the power_sector_data files output by step 16 to get hourly generation and generated emission rates, so we would either need to output those extra hours to the power_sector_data files or rewrite the consumed emission code to take a dataframe from data_pipeline.py instead of using the outputs. I like reading from the files because it keeps the sections of the code more independent -- consumed.py isn't relying on a dataframe that might change. This could be solved in the future by rewriting the code to make more guarantees about data formats, but for now, it's much easier to write and debug consumed emission code that reads all its data from files.

I think this highlights an issue with cutting off the data based on local time -- someone interested in operations over the entire US won't have a whole year for the entire US; instead, they'll have 6 hours with incomplete data. I prefer using UTC times personally, since it simplifies analysis and you can always convert back to local time for visualization.

grgmiller commented 2 years ago

After looking into this, the challenge with adding additional data for year +/- 1 day is that during the EIA shaping process, we are shaping monthly data in local time - so even if we have hourly profiles for +/- 1 day, there is no monthly data for that month to shape.

Several options:

  1. Hardest: Load data for the 2 months surrounding the given year. I think that actually loading data for those edge hours will be difficult, especially since complete 2021 EIA data is not published yet (although incomplete monthly/early-release data is available). However this seems like a lot of work for 6 hours of data (0.07% of the data).
  2. Easiest: before solving the matrix, forwardfill and backfill any missing values for these edge hours (e.g. west coast BAs would have 3 hours of backfilled data at the beginning of the year, and east coast BAs would have 3 hours of forward filled data at the end of the year). This assumes that Alaska and Hawaii are not included in the consumed emissions calculations? Within this option, there are different ways we could consider filling these hours. First would be to use a simple ffill/bfill. We could also potentially just copy the hourly values from the next/preceeding day and use those. There are potentially more complex methods, but not sure if it is worth it.
grgmiller commented 2 years ago

Advanced by https://github.com/singularity-energy/open-grid-emissions/pull/199