singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Compress OGE Outputs #366

Closed grgmiller closed 1 month ago

grgmiller commented 1 month ago

Purpose

As we begin exporting historical data for almost two decades, it is becoming harder to store all of the files locally due to running out of hard drive space. For each year, the outputs folder currently takes up about 15GB, which means to run all 17 years you would need over 250GB of space.

To address this, this PR starts compressing all outputs files (files in the outputs folder) to .csv.zip files, which should compress the outputs by about 85%.

This changes the file extension from .csv to .csv.zip, which means that existing functions in external projects that read these files will need to update their pd.read_csv() functions to use the new filename. However, the advantage of using this naming convention is that the default behavior of the pd.read_csv() "compression" argument is "infer", which means that pandas will automatically infer this is a zip file from the filename, and no other arguments need to be updated.

Another advantage of using this file extension is that it 1) makes it clear that the zipped file is a csv, and 2) allows the user to still directly open the file in Excel without having to decompress it first (at least on a windows machine).

Importantly, read_csv can directly read these zipped files from s3, so this new implementation is agnostic to whether the data is read locally or on s3.

Results: Pre compression outputs folder: 15.8 GB total (14.8 of which were the non-EIA930 csvs) Post compression: 2.4GB (1.4GB of which were the non-EIA930 csvs)

Closes CAR-4209

What the code is doing

In addition to the above, I made a small change in impute_hourly_profiles to only merge specific columns from the plant attributes table into CEMS, as I kept running into a memory error on my computer at that merge step. The changes seem to have fixed the memory error.

Testing

First, I created a zipped version of an existing output file locally:

image

I was then successfully able to read the file locally just by changing the file extension:

image

After uploading this test file to s3, I was also able to read it from there:

image

I test ran the pipeline for 2019.

Where to look

The main change is in output_data, since we have a single function that handles all data exports to the outputs folder.

I also had to update the file extension whenever reading data.

Usage Example/Visuals

See testing section

Review estimate

10 min

Future work

All of the files in outputs/eia930 remain uncompressed, because compressing them would involve modifying the gridemissions code as well. This folder alone takes up about 1GB, so we might want to compress this in the future. However, this file is only populated for years after 2019, so it has a limited impact on the size of all historical years.

Checklist