singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
75 stars 5 forks source link

Final tasks pre public launch #188

Closed grgmiller closed 2 years ago

grgmiller commented 2 years ago

Tagging @miloknowles and @burkaman in case you have any thoughts/feedback on any of these issues (or if we are missing anything)

Code / data validations

Test the pipeline

Repository configuration

Launch (TODO by Monday 8/29)

Pick a versioning system

gailin-p commented 2 years ago

For our repository, we could use semantic versioning or calendar versioning.

I would prefer semantic versioning for the code. My understanding is that we expect to keep improving the code, and not necessarily on the same schedule that new data years become available. Also, we've written the code to be year-agnostic, so this first release is actually 2019 and 2020. In the future we may extend release years backwards (#117) as well as forward, which would make year-based versioning confusing -- a version tagged with an older year could actually be newer code. Another potential issue is that someone looking for the best version of an older year may mistakenly choose an older version thinking that it's the only version that will work for that year, when really we want them to use the latest version of the code with the --year option.

If we do go with calendar versioning, I think we should use release date, not data date, which would resolve the concerns above but could be confusing if users mistake it for a data date.

I also think we may want to think about the zenodo as two separate repos: one for code, one for data. This is similar to what PUDL does: it has a software release and a data release. Assuming we separately upload code and data to zenodo, we can have separate (but releated) versioning schemes: eg if a dataset was created with v0.5.0 of the code, and includes years 2019 and 2020, it could be 0.5.0/2019.2020.

gailin-p commented 2 years ago

Go through all of the outputs/results files and make sure they look as expected

One small concern I have is that the power_sector_data outputs are a mix of wide and long format: different adjustments and pollutants are handled in wide format (different columns) but different fuel types are handled in long format (repeated rows for a single datetime). I always forget this when loading the data, but maybe that's just a me problem? My preference would be to have fuel types as additional columns. I know this makes it more difficult to scan through the columns, but it's more intuitive for me than the mix

gailin-p commented 2 years ago

Additional pre-launch issues

gailin-p commented 2 years ago

In the consumed outputs, there are some blank timestamps at the beginning and end of each file... @gailin-p do you know what's going on here?

Yes, I ran the algorithm for every 2020 (or 2019) hour as defined by UTC 2020, but the power_sector_data uses local time to define the year, so there are timestamps at the beginning and end where the consumed emission matrix is missing some values and so can't be solved. Even if I used local time to define the start/end instead, there would still be some missing hours, because 2020 starts for east coast BAs before west coast BAs (and vice versa at the end of the year). I can fix the output by just dropping those incomplete hours.

grgmiller commented 2 years ago

I think your comments make sense about the using semantic versioning over calendar versioning. I like the idea of having a combined version number for the data on Zenodo.

Regarding the format of the power sector data: I kind of like the current format because it avoids having super wide data, and I think it probably makes it easier for people to use the data, instead of having to parse column names to get the data they want (although along those lines, maybe we should consider making the entire csv into long format?

Maybe let's go ahead and only keep data that is in the current year based on local time. I think that we want to ensure that we have a complete datetime series for all 8760/8784 hours of the year. Ideally we should figure out a way to ensure that all hours local time are complete.

For the prime mover warning of 91 generators missing prime mover codes, what year is that for? 2020 only has 1 generator with missing codes. And which NOx and SO2 warnings were getting raised?

grgmiller commented 2 years ago

Thoughts on future branching structure

Related to our conversation about versioning, we should also figure out an approach to how we want to manage our branches in the future. My understanding is that whenever the main branch changes we would change the version number, so we might want to set up a dev branch that we would merge all of our feature branches into until we are ready to publish a new release?

gailin-p commented 2 years ago

I kind of like the current format because it avoids having super wide data, and I think it probably makes it easier for people to use the data, instead of having to parse column names to get the data they want (although along those lines, maybe we should consider making the entire csv into long format?

Ok! I think it comes down to personal preference + what tools someone's using, so I think it's fine if this is how you like it. Just wanted to make sure we'd thought about it

I think that we want to ensure that we have a complete datetime series for all 8760/8784 hours of the year. Ideally we should figure out a way to ensure that all hours local time are complete.

Having an entire year of consumed emissions will require having 6 hours more than a year (3 extra hours at the beginning of the year for western BAs, 3 extra hours at the end of the year for eastern BAs) of produced emissions data fed into the consumed emissions code. Does that seem like something we should try to do by next week?

grgmiller commented 2 years ago

Does that seem like something we should try to do by next week?

I think that anyone using the data for carbon accounting will need a complete timeseries, so this is probably something we need to fix before launch to make the data usable. I suppose the easiest method would just be to forward fill / back fill the missing data. However, it looks like in step 12 of the pipeline, in load_chalendar_for_pipeline() we are already loading data for the given year +/- 1 day (eg 12-31-2019 - 01-02-2021), so we could just take the same approach in step 17 and then just filter the data before outputting... does that seem like a pretty easy change?

gailin-p commented 2 years ago

However, it looks like in step 12 of the pipeline, in load_chalendar_for_pipeline() we are already loading data for the given year +/- 1 day (eg 12-31-2019 - 01-02-2021), so we could just take the same approach in step 17 and then just filter the data before outputting... does that seem like a pretty easy change?

Unfortunately it's not that simple. The code currently uses the power_sector_data files output by step 16 to get hourly generation and generated emission rates, so we would either need to output those extra hours to the power_sector_data files or rewrite the consumed emission code to take a dataframe from data_pipeline.py instead of using the outputs. I like reading from the files because it keeps the sections of the code more independent -- consumed.py isn't relying on a dataframe that might change. This could be solved in the future by rewriting the code to make more guarantees about data formats, but for now, it's much easier to write and debug consumed emission code that reads all its data from files.

I think this highlights an issue with cutting off the data based on local time -- someone interested in operations over the entire US won't have a whole year for the entire US; instead, they'll have 6 hours with incomplete data. I prefer using UTC times personally, since it simplifies analysis and you can always convert back to local time for visualization.

grgmiller commented 2 years ago

Good point. I like the idea of outputting the extra hours to the power sector results for the reasons you raised.

However, I would advocate for keeping the carbon accounting data sorted by local time. While the user of the power sector data is more likely to be programmatically accessing the data and easily be able to concat regions and convert timezones, I think that the average user of the carbon accounting data may be using in csv/excel format, and only needs data for their local time. Even for organizations that might have load in multiple grid regions they would only working relative to the local time of each region.

That being said, I suppose that there are probably certain grid regions that span 2 different time zones, so an end user might have to convert the local reported timezone into the timezone where their load is located. In that case, they would then have a missing hour at the beginning or end of their data. So maybe it would make sense to have some extra data at the beginning and end of the file to accomodate for this. Thinking about this issue, maybe in the future for an improved user experience, it would make sense to add multiple local datetime columns to the carbon accounting data if the region spans multiple timezones, and also naming them based on the timezone so that it is more obvious for people who are not familiar with utc offsets (e.g. datetime_central and datetime_eastern or datetime_EPT)