Crashes with out of memory on full run

gailin-p commented 2 years ago

On my 16GB ram mac, data_pipeline without --small crashes. The only error I get is "Killed: 9", indicating out of memory.

This problem appeared after https://github.com/singularity-energy/hourly-egrid/pull/84, which added back data that was being erroneously dropped during groupby (https://github.com/singularity-energy/hourly-egrid/issues/81)

grgmiller commented 2 years ago

Potential options for fixing

Specify / convert dtypes for each dataframe column to optimize memory, downcasting wherever possible (see also https://github.com/singularity-energy/hourly-egrid/issues/35)
Ensure that metadata columns are stored as categorical dtypes (includes columns that indicate data source, method used, etc)
Move metadata columns to separate dataframe
Many rows of data are zeros... is there a way to store this more efficiently without sacrificing data completeness?
Clear memory of dataframes once they are exported to csv and/or no longer needed in the data pipeline (consider using garbage collection)
Be careful when copying dataframes
Switch to using dask (if switching to dask, we would want to make sure that all of the pandas functionality we currently use is compatible with dask)

This issue mostly becomes salient when concatenating the three dataframes of hourly plant level data together into a new dataframe (cems, eia923_shaped, and partial_cems). This would be a good place to pay attention to this

gailin-p commented 2 years ago

I think the advantage of dask (if we can get it working) is that it should scale better than the other solutions, potentially allowing:

Analysis of multiple years
Users with smaller machines to run the pipeline
More or more complex analysis on the full dataset if we eventually want that

grgmiller commented 2 years ago

You also noted that shaped_eia_data and combined_plant_data are huge files. Part of the reason for this is that they currently include data at the subplant level rather than the plant level.

Part of the reason that I had split out the data by subplant was because this was the level at which we apply our data cleaning and imputation methodologies, so a single plant may use methods of different types.

However, there is probably a more efficient way to track these methodological inputs rather than keeping data at the subplant level in the final data. I suspect that if we output plant level data instead of subplant level data it would significantly reduce memory usage.

grgmiller commented 2 years ago

So after running the full data pipeline, I found that while outputs/cems_2020.csv is about 6.8 GB, the outputs/shaped_eia923_data_2020.csv file is 16.4 GB (!). Although the shaped EIA data accounts for < 10% of all emissions, it may contain a large number of small plants, making the file quite large.

This dataframe contains 78.9 million rows of data and 22 columns. Of these, about 14.1 million rows (17.9%) contain all zeros, and 1.9 million rows (2.4%) are NA. There are 9966 unique plant IDs in this data.

After thinking about this more, I am wondering if we should actually be assigning an hourly profile to each specific plant, or whether this provides a false sense of precision that may not be very helpful or accurate. For example, while each natural gas plant in this dataframe has different monthly data, the hourly shape assigned to all natural gas plants in the same BA will be the same.

Instead, it might be a better approach to aggregate all of the plants in monthly_eia_data_to_shape by fuel category and BA, so that we would only have hourly data for all of the non-CEMS reporting in accurate, while still representing the aggregate hourly profile of these generators. This should significantly reduce the size of this dataframe. One potential downside of this approach is that if only part of a plant reports to CEMS, then the plant-level data reported for that plant would be incomplete.

If taking this approach, we may want to come up with a synthetic plant id number so that these aggregated plants can consistently be identified. For example, we could create a 5-digit plant ID starting with 9 where the next two numbers identify the BA, and the final 2 numbers identify the fuel category. For example, if we said that ERCOT was 07 and natural gas was 01, then the synthetic plant ID would be 90701.

Another option which may help address this issue in the short term would be to not try and export plant-level data in the short run due to the size of the output data - we could potentially add this for a future release of the data.

Another option to reduce the size of the dataframe would be to limit the number of columns. We could potentially remove:

report date column
profile method - split into a separate dataframe as metadata (or code the methods as integers to reduce the size of this column)
Remove the five adjusted emissions totals, and keep only the total mass and for electricity mass columns.

In summary:

Consider shaping fuel-BA aggregated data instead of plant level data, otherwise:
Reduce the dimensionality of the plant-level dataframe by: 1) Removing certain columns 2) Removing plant-hours with all zero or NA data

gailin-p commented 2 years ago

I think it would be a good idea to report the plant-level data for non-CEMS plants in separate files, whether that eventually looks like individual profiles or BA-level aggregates, to reduce the possibility of confusion about the accuracy of that data.

For the first release, I think it's useful to report the plant-level hourly CEMS data, which is a small subset of the plants (only ~10% of plants report to CEMS) so shouldn't be a size issue. Sticking to just that data for the initial release would make it easier to communicate the certainty we have in the hourly profiles (pretty high for CEMS plants) and fix the size issue (with some refactoring to avoid making the non-CEMS plant hourly profiles into an in-memory dataframe), while still giving users access to a clean, net-generation version of CEMS data for plants where it's available.

One issue I see with relying on aggregates of non-CEMS plants to deal with size is that it removes options for other hourly profile allocation methods. If we want to explore methods like the load-based allocation you've proposed, we'll eventually need to handle hourly data for all the non-CEMS plants in the pipeline, even if we don't export it.

grgmiller commented 2 years ago

Ok so it sounds like what we might want to do is:

for results/plant_data:

[ ] write a file only for CEMS data (maybe aggregated to subplant instead of plant)
[ ] write a separate files for partial cems data (also aggregated at subplant)

I think for now, we may want to shape the monthly EIA data at the BA-fuel level. I agree that this closes options for other allocation methods, but we could re-implement plant-level shaping in future if needed. However, if we want to keep plant-level allocation, then we definitely need to implement dask at this step (and maybe we should just make the plunge and implement dask throughout the pipeline).

gailin-p commented 2 years ago

Closed by #91

singularity-energy / open-grid-emissions

Crashes with out of memory on full run #86