Gailinpease/snakemake proof of concept

gailin-p commented 1 year ago

This PR demonstrates how we might use Snakemake as a workflow manager by implementing a snakemake rule for the final step in the pipeline (calculating consumed emissions).

Why snakemake?

In active development + active user community
Open source and free to use (so any OGE user can use it)
Supports running both locally and in cloud using the same rules (though configuring to run on AWS would require some work)
By default, decides which jobs need to be run based on file change timestamps (both source code and data). This means that as we make changes to the pipeline, Snakemake will detect which parts of the pipeline need to be re-run, and we can avoid re-computing expensive but rarely-changed tasks (eg, 930 cleaning) without having to hardcode that logic.

How does it work?

Define rules that describe which inputs a step takes, what outputs it produces, and what to run to produce the outputs (in this PR, I use a python script, but the job can also be a command line script, a jupyter notebook (including with a custom conda env), or in-line python).
On the command line, ask snakemake to produce a result file: snakemake -n data/results/base/2019/carbon_accounting (-n signals dry run, so snakemake won't actually produce anything)
Snakemake tells you what rules it's running and why. Currently there's only one rule, but with more rules, snakemake will detect dependencies between input and output files and decide which jobs to run based on updates to input data or code.

If we instead ask for data/results/small/2019/carbon_accounting, snakemake will detect based on the input for the rule that we need to use the downloaded 930 file instead of the outputs 930 file, replicating the logic currently in data_pipeline.py: Screen Shot 2023-02-20 at 5 27 27 PM

Advantages of snakemake over data_pipeline.py

New rules can be added (for example, for validation tasks) without adding computational time to the main pipeline, since Snakemake will detect that they're not necessary for computing results/2019, for example
Parts of the pipeline can be run separately without manually copying code into a jupyter notebook
Files for intermediate outputs between jobs are captured as input and output files (this could also eventually support validation and/or manual inspection of intermediate output files)
Dependencies between steps are captured
The workflow is captured in the workflow folder, which will eventually support separately packaging src

Disadvantages

Snakemake will likely not be able to detect code changes affecting individual jobs since most jobs depend on code in data_pipeline.py. You can tell snakemake to only consider the timestamps of input and output files using --rerun-trigger mtime
Because of how snakemake wildcards work, we'll need to put results and outputs in "results/base/{year}" if we also want to use the rules to generate "results/small/{year}"
Snakemake will not play nicely with skip_outputs (implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I think skip_outputs should be less necessary though if snakemake helps us avoid running the entire pipeline

gailin-p commented 1 year ago

@miloknowles this may also be interesting to you -- this is a proof of concept for one option for moving away from a monolithic script.

miloknowles commented 1 year ago

This is awesome! Would love to see this implemented in OGE and used in other projects too.

grgmiller commented 1 year ago

This is really interesting! Thanks for looking into this. I agree that this sounds really exciting/promising.

A couple of reactions/questions:

Snakemake will likely not be able to detect code changes affecting individual jobs since most jobs depend on code in data_pipeline.py.

What would be an example of this? What is the definition of a "job"?

Because of how snakemake wildcards work, we'll need to put results and outputs in "results/base/{year}" if we also want to use the rules to generate "results/small/{year}" Snakemake will not play nicely with skip_outputs (implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I think skip_outputs should be less necessary though if snakemake helps us avoid running the entire pipeline

I think with snakemake implemented, we could probably get rid of both the --small option and the --skip_outputs options for the reasons you mentioned.

Code changes and re-running

A job is one piece of the pipeline that we decide is conceptually independent and choose to separate into a snakemake rule. In this PR, that would be step 18 in the pipeline: calculating consumed emissions. The biggest practical difference from our current pipeline steps is that a job must read and write files (can't pass in-memory outputs from or to another job).

Because each job reads its input data from files, if there have been no changes to the data in those files and no changes to the code that the job runs, we can be confident that the job would produce the same output file that it created last time. Snakemake detects that (and also does the same check for all the jobs that produced dependencies to this job) and doesn't re-run.

I'm not sure what algorithm snakemake uses to detect code changes. I suspect it's just file timestamps, and so all jobs that import data_cleaning.py would be triggered on any change to data_cleaning.py, instead of just the job using the function that changed. We could improve this by splitting up data_cleaning.py into smaller files (which will be good for packaging anyways, and for helping people new to the project find code they need)

singularity-energy / open-grid-emissions