Closed rouille closed 4 months ago
At a high level, I'm wondering if the proposed solution here is a bit over-engineered and adding unnecessary complexity. Maybe I don't fully understand what problem the memory caching is solving, but it seems like the addition of a ba code to these warning messages could have been addressed with a simpler solution like we discussed like:
df = df.merge(create_plant_ba_table(year)[["plant_id_eia,"ba_code"]], how="left", on="plant_id_eia", validate="m:1")
. If we want to avoid re-loading this dataframe multiple times, one other potential solution could have been to load it once at the top ofvalidation.py
and have it accessed as a global variable in each validation function that uses it.This is all to say - if there is a compelling reason for caching in memory, let's do that, but that reason is not currently clear to me, and I'm hesitant to add more complexity to OGE if not needed.
I appreciate all the work to address the warning messages we've been seeing in pandas, as well as splitting out
data_cleaning
into smaller chunks by addinghelpers
(although see my comment about ensuring we are consistent about what functions end up where).A couple other high-level comments:
* If we end up restructuring the code and/or adding new modules, we'll need to make sure that the readme is updated * Whenever we are adding ba_codes to a dataframe, we should make sure that we are only ever adding this to "throwaway" dfs that are not passed further in the data pipeline; otherwise, we should be sure to drop the ba_code column from the df after the warning message.
I will implement your solution, remove the oge.utility
module and update the README to list the new helper module.
Here is the logfile. It has been compared against the one from the release and it looks good. data_pipeline.log
Purpose
Add BA code to printouts.
What the code is doing
Whenever a plant-level data frame is printed out when running the pipeline, the ba code associated to each plant is retrieved and inserted in the printed data frame. Existing functions are used to create a helper function that return a dictionary mapping the EIA plant id to the BA code.
Testing
Running the pipeline
Where to look
oge.helpers
modulepandas
related warning throughout the codeif
statement in thecheck_for_complete_monthly_timeseries
in theoge.validation
moduleUsage Example/Visuals
Review estimate
20min
Future work
Flag anomalous values in CEMS input timeseries.
Checklist
black