metomi / rose

:rose: Rose is a toolkit for writing, editing and running application configurations.
https://metomi.github.io/rose/
GNU General Public License v3.0
55 stars 50 forks source link

Support configuration of Jupyter Notebook jobs #2229

Open sadielbartholomew opened 6 years ago

sadielbartholomew commented 6 years ago

Motivation

I recently read a really interesting blog post about workflow scheduling for data analytics at Netflix [1 & (the follow-up) 2]. In short, they have recently begun extensively adopting Jupyter Notebooks as workloads to execute via scheduler (NB Airflow is referenced [2]). In particular, they have used a library called Papermill [3], though alternative tools for the management & execution of notebooks seem to be available & could be explored.

After further investigation & thought it occurred to me that notebooks (with immutable input & output as discussed in those posts) configured and executed as scheduled jobs in Cylc suites could open up possibilities in various areas dominating our overall future vision, though this is more in the domain of Rose than Cylc as it concerns jobs & the resources they need rather than the scheduling thereof. Namely (forgive me, the following is a bit of a brain dump):

  1. Modularity & granularity: quoting directly from the blog posts, "because notebooks describe a linear flow of execution, broken up by cells, we can map failure to particular cells". Cells within notebooks would function to an extent as sub-workflows.
  2. Abstraction, containerisation & black-box inputs & outputs (c.f. e.g. cylc/cylc#2764): the result of notebook execution from an immutable input file is an immutable output file, with internals isolated within the notebook server. There is native potential for containerised environment or cloud-based execution, e.g. s3 objects supported.
  3. Flexibility: again quoting, "with different kernels, notebooks can support a wide range of languages and execution patterns" [2].
  4. Documentation & metadata: ...alongside the code & able to be formatted with markdown &/or as a workflow 'narrative'.
  5. In general: Jupyter Notebooks are it seems increasingly used in scientific &/or data-intensive computing applications. If we can support their use within Rose, it would set us up well for the future.

Concept

By bundling a fairly-lightweight dedicated library such as Papermill, we could (I believe - do chip in if you foresee complications) configure, manage & run Jupyter Notebooks as Cylc jobs by means of a trivial Rose application, as illustrated below.

Bare-bones implementation

With Papermill, notebooks can be executed according to their API using either Python or the shell [3]. Python seems more intuitive for processing parameters, so I would be inclined to choose it, but a CLI approach could also work I suspect. For example, a generic solution would be a rose-app.conf as follows:

[command]
default=python run_papermill.py

[env]
# Define variables/parameters required as inputs to the notebook job:
PARAMETER_A=example_string
PARAMETER_B=0.1
# ... etc.
# ... (further parameters)

which would simply execute a Python file that processes the parameters and runs the notebook with these inputs using the papermill API:

import papermill as pm

INPUT_NOTEBOOK_FILE_LOC="<path>/input_example.ipynb"
OUTPUT_NOTEBOOK_FILE_LOC="<path>/output_example.ipynb"

# load in rose-app.conf env vars

def convert_env_to_dict(<args>):
   """ Convert environment variables from rose-app.conf[env] to Python dict."""
   # conversion code here
   return parameters_dict  # e.g. dict(alpha=0.6, ratio=0.1) example from papermill docs

# execute the notebook with the set parameters input
pm.execute_notebook(
   INPUT_NOTEBOOK_FILE_LOC,
   OUTPUT_NOTEBOOK_FILE_LOC,
   parameters = convert_env_to_dict(<env vars>)
)

References

  1. Netflix Technology Blog: 'Beyond Interactive: Notebook Innovation at Netflix'
  2. Netflix Technology Blog: 'Scheduling Notebooks at Netflix'
  3. Papermill library codebase & docs
  4. JupyterCon 2018 talk 'Scheduled notebooks: A means for manageable and traceable code execution' (link to) slides
hjoliver commented 5 years ago

(interesting idea @sadielbartholomew - somehow I missed this when you posted it... I'll come back to read more closely when I get a chance...)