singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Gailinpease/snakemake proof of concept #283

Open gailin-p opened 1 year ago

gailin-p commented 1 year ago

This PR demonstrates how we might use Snakemake as a workflow manager by implementing a snakemake rule for the final step in the pipeline (calculating consumed emissions).

Why snakemake?

How does it work?

  1. Define rules that describe which inputs a step takes, what outputs it produces, and what to run to produce the outputs (in this PR, I use a python script, but the job can also be a command line script, a jupyter notebook (including with a custom conda env), or in-line python).
  2. On the command line, ask snakemake to produce a result file: snakemake -n data/results/base/2019/carbon_accounting (-n signals dry run, so snakemake won't actually produce anything)
  3. Snakemake tells you what rules it's running and why. Currently there's only one rule, but with more rules, snakemake will detect dependencies between input and output files and decide which jobs to run based on updates to input data or code. Screen Shot 2023-02-20 at 5 15 47 PM

If we instead ask for data/results/small/2019/carbon_accounting, snakemake will detect based on the input for the rule that we need to use the downloaded 930 file instead of the outputs 930 file, replicating the logic currently in data_pipeline.py: Screen Shot 2023-02-20 at 5 27 27 PM

Advantages of snakemake over data_pipeline.py

Disadvantages

gailin-p commented 1 year ago

@miloknowles this may also be interesting to you -- this is a proof of concept for one option for moving away from a monolithic script.

miloknowles commented 1 year ago

This is awesome! Would love to see this implemented in OGE and used in other projects too.

grgmiller commented 1 year ago

This is really interesting! Thanks for looking into this. I agree that this sounds really exciting/promising.

A couple of reactions/questions:

Snakemake will likely not be able to detect code changes affecting individual jobs since most jobs depend on code in data_pipeline.py.

What would be an example of this? What is the definition of a "job"?

Because of how snakemake wildcards work, we'll need to put results and outputs in "results/base/{year}" if we also want to use the rules to generate "results/small/{year}" Snakemake will not play nicely with skip_outputs (implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I think skip_outputs should be less necessary though if snakemake helps us avoid running the entire pipeline

I think with snakemake implemented, we could probably get rid of both the --small option and the --skip_outputs options for the reasons you mentioned.

Other questions

  1. What implications would implementing this have for other OGE users who want to run the OGE pipeline locally on their computer (vs us running on a server where we have the room to store all of the inputs and outputs)? Would they still be able to do so?
  2. What implications would this have for the readability of our code? Is there a new snakemake format/language that would have to be learned to understand what is going on?
  3. What would the data pipeline end up looking like? Would we be breaking off chunks into separate scripts?
gailin-p commented 1 year ago

Code changes and re-running

A job is one piece of the pipeline that we decide is conceptually independent and choose to separate into a snakemake rule. In this PR, that would be step 18 in the pipeline: calculating consumed emissions. The biggest practical difference from our current pipeline steps is that a job must read and write files (can't pass in-memory outputs from or to another job).

Because each job reads its input data from files, if there have been no changes to the data in those files and no changes to the code that the job runs, we can be confident that the job would produce the same output file that it created last time. Snakemake detects that (and also does the same check for all the jobs that produced dependencies to this job) and doesn't re-run.

I'm not sure what algorithm snakemake uses to detect code changes. I suspect it's just file timestamps, and so all jobs that import data_cleaning.py would be triggered on any change to data_cleaning.py, instead of just the job using the function that changed. We could improve this by splitting up data_cleaning.py into smaller files (which will be good for packaging anyways, and for helping people new to the project find code they need)

Other questions

What implications would implementing this have for other OGE users who want to run the OGE pipeline locally on their computer (vs us running on a server where we have the room to store all of the inputs and outputs)? Would they still be able to do so?

Yes, this should be fine. Most computers have way more disk space than memory, so writing files currently held in-memory to disk will be fine, though it would probably slow down the first run (later runs would be faster because many jobs would not have to re-run). Also, using snakemake means that users won't have to write eg. plant-level files if they only want BA-level outputs, as long as we make those jobs independent.

What implications would this have for the readability of our code? Is there a new snakemake format/language that would have to be learned to understand what is going on?

Snakemake is a python-based language. I think it's simpler than data_pipeline.py, because it declares dependencies for each job, while in data_pipeline.py you have to carefully track a variable through the code to understand where it's used and which steps change it. You can look at workflow/Snakefile in this PR to see what it looks like.

What would the data pipeline end up looking like? Would we be breaking off chunks into separate scripts?

The logic would be split between workflow/Snakefile, which would have the name of each job and its dependencies (this declaration is called a "rule"), and a set of scripts, each of which would hold the logic for a single job. For very simple jobs (load a file, run a single function, write a file), logic can be written directly in the Snakefile rule, but it's not good practice to hold a lot of logic in a rule.