jdechalendar / gridemissions

Tools for power sector emissions tracking
MIT License
35 stars 6 forks source link

Bulk Processed Files - dealing with duplicate periods across years #30

Open klo9klo9kloi opened 2 months ago

klo9klo9kloi commented 2 months ago

When looking at e.g. EIA930_2019_Jan_Jun_co2.csv, the period starts a few hours after 2019-01-01 00:00:00 rather than at the mark, which I'm guessing has something to do with UTC shifting.

Keeping that example file, if I then look at EIA930_2018_Jul_Dec_co2.csv, the period also overflows into year 2019 for a few hours, such that if I concatenate these two files then there are some duplicate periods.

If I am aggregating emissions by year, what is the proper way to deal with these duplicate period rows? Aggregate? Take the one from the latest year?

jdechalendar commented 1 month ago

Did you check if the raw EIA files also had this overlap?

How many hours of overlap are you seeing?

I think this code should be working with UTC only in the internals and only converting for exports/plotting.

klo9klo9kloi commented 1 month ago

Sifting through the bulk processed files again, it looks like it is always 20XX-01-01 06:00:00 20XX-01-01 07:00:00 20XX-01-01 08:00:00, 20XX-07-01 05:00:00 20XX-07-01 06:00:00 20XX-07-01 07:00:00 that overlap between neighboring files

I just checked the raw EIA files and they do not have this problem, the next file always just picks up at the next hour