Open grgmiller opened 2 years ago
After looking into this, the challenge with adding additional data for year +/- 1 day is that during the EIA shaping process, we are shaping monthly data in local time - so even if we have hourly profiles for +/- 1 day, there is no monthly data for that month to shape.
Several options:
A handful of hours at the beginning and end of each consumed emissions output file contains missing values.
Because the consumed emissions matrix uses utc datetime, but the input power_sector_data uses local time to define the year, so there are timestamps at the beginning and end where the consumed emission matrix is missing some values and so can't be solved. Even if I used local time to define the start/end instead, there would still be some missing hours, because 2020 starts for east coast BAs before west coast BAs (and vice versa at the end of the year).
Ideally, we want to ensure that we have a complete datetime series for all 8760/8784 hours of the year. If there is missing data, this will not be useful to data users.
Having an entire year of consumed emissions will require having 6 hours more than a year (3 extra hours at the beginning of the year for western BAs, 3 extra hours at the end of the year for eastern BAs) of produced emissions data fed into the consumed emissions code.
The code currently uses the power_sector_data files output by step 16 to get hourly generation and generated emission rates, so we would either need to output those extra hours to the power_sector_data files or rewrite the consumed emission code to take a dataframe from data_pipeline.py instead of using the outputs. I like reading from the files because it keeps the sections of the code more independent -- consumed.py isn't relying on a dataframe that might change. This could be solved in the future by rewriting the code to make more guarantees about data formats, but for now, it's much easier to write and debug consumed emission code that reads all its data from files.
I think this highlights an issue with cutting off the data based on local time -- someone interested in operations over the entire US won't have a whole year for the entire US; instead, they'll have 6 hours with incomplete data. I prefer using UTC times personally, since it simplifies analysis and you can always convert back to local time for visualization.