singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Problems with raw EIA-923 data for 2020 #175

Closed miloknowles closed 1 year ago

miloknowles commented 1 year ago

When running the data pipeline, one of the steps is to download raw EIA-923 and EIA-860 for a specified year. For some reason, the zip file downloaded for 2020 EIA-923 data is corrupted/unreadable.

Download URL: https://www.eia.gov/electricity/data/eia923/archive/xls/f923_2020.zip

I confirmed that manually downloading also fails, and that the download is not empty (contains about 20MB of zipped data).

@grgmiller @gailin-p have you encountered this error before? I assume you've worked with the raw 2020 data successfully, so I must be doing something wrong.

For reference, this is my output when unzipping:

(hourly-egrid) milo.knowles@milo-osx eia923 % unzip f923_2020.zip

Archive:  f923_2020.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of f923_2020.zip or
        f923_2020.zip.zip, and cannot find f923_2020.zip.ZIP, period.
grgmiller commented 1 year ago

Hi Milo - I haven't encountered this before. When I clicked the url you shared, the downloaded zip file also wouldn't open on my computer. However, when I tried downloading it again a second time, it did work! I've tried re-downloading multiple times and it is working consistently now.

One thing that I've found when working with zipfiles is that there seems to be a temporary file that downloads at the same time that contains partial data, which doesn't get removed instantly when the file is done downloading. I usually insert a short sleep() function after downloading a zipfile so make sure that any temporary files are cleared before continuing. That might be the issue in this case?

LMK if this continues to be an issue, and maybe we can add some sort of error handling to re-try the download if this happens. Do you happen to know the name of the error that was raised when running the pipeline?

miloknowles commented 1 year ago

Still no success after manually downloading in my browser 4-5 more times. I'll keep trying periodically to see if I can get that 2020 data.