I recently read a really interesting blog post about workflow scheduling for data analytics at Netflix [1 & (the follow-up) 2]. In short, they have recently begun extensively adopting Jupyter Notebooks as workloads to execute via scheduler (NB Airflow is referenced [2]). In particular, they have used a library called Papermill [3], though alternative tools for the management & execution of notebooks seem to be available & could be explored.
After further investigation & thought it occurred to me that notebooks (with immutable input & output as discussed in those posts) configured and executed as scheduled jobs in Cylc suites could open up possibilities in various areas dominating our overall future vision, though this is more in the domain of Rose than Cylc as it concerns jobs & the resources they need rather than the scheduling thereof. Namely (forgive me, the following is a bit of a brain dump):
Modularity & granularity: quoting directly from the blog posts, "because notebooks describe a linear flow of execution, broken up by cells, we can map failure to particular cells". Cells within notebooks would function to an extent as sub-workflows.
Abstraction, containerisation & black-box inputs & outputs (c.f. e.g. cylc/cylc#2764): the result of notebook execution from an immutable input file is an immutable output file, with internals isolated within the notebook server. There is native potential for containerised environment or cloud-based execution, e.g. s3 objects supported.
Flexibility: again quoting, "with different kernels, notebooks can support a wide range of languages and execution patterns" [2].
Documentation & metadata: ...alongside the code & able to be formatted with markdown &/or as a workflow 'narrative'.
In general: Jupyter Notebooks are it seems increasingly used in scientific &/or data-intensive computing applications. If we can support their use within Rose, it would set us up well for the future.
Concept
By bundling a fairly-lightweight dedicated library such as Papermill, we could (I believe - do chip in if you foresee complications) configure, manage & run Jupyter Notebooks as Cylc jobs by means of a trivial Rose application, as illustrated below.
Bare-bones implementation
With Papermill, notebooks can be executed according to their API using either Python or the shell [3]. Python seems more intuitive for processing parameters, so I would be inclined to choose it, but a CLI approach could also work I suspect. For example, a generic solution would be a rose-app.conf as follows:
[command]
default=python run_papermill.py
[env]
# Define variables/parameters required as inputs to the notebook job:
PARAMETER_A=example_string
PARAMETER_B=0.1
# ... etc.
# ... (further parameters)
which would simply execute a Python file that processes the parameters and runs the notebook with these inputs using the papermill API:
import papermill as pm
INPUT_NOTEBOOK_FILE_LOC="<path>/input_example.ipynb"
OUTPUT_NOTEBOOK_FILE_LOC="<path>/output_example.ipynb"
# load in rose-app.conf env vars
def convert_env_to_dict(<args>):
""" Convert environment variables from rose-app.conf[env] to Python dict."""
# conversion code here
return parameters_dict # e.g. dict(alpha=0.6, ratio=0.1) example from papermill docs
# execute the notebook with the set parameters input
pm.execute_notebook(
INPUT_NOTEBOOK_FILE_LOC,
OUTPUT_NOTEBOOK_FILE_LOC,
parameters = convert_env_to_dict(<env vars>)
)
Motivation
I recently read a really interesting blog post about workflow scheduling for data analytics at Netflix [1 & (the follow-up) 2]. In short, they have recently begun extensively adopting Jupyter Notebooks as workloads to execute via scheduler (NB Airflow is referenced [2]). In particular, they have used a library called Papermill [3], though alternative tools for the management & execution of notebooks seem to be available & could be explored.
After further investigation & thought it occurred to me that notebooks (with immutable input & output as discussed in those posts) configured and executed as scheduled jobs in Cylc suites could open up possibilities in various areas dominating our overall future vision, though this is more in the domain of Rose than Cylc as it concerns jobs & the resources they need rather than the scheduling thereof. Namely (forgive me, the following is a bit of a brain dump):
Concept
By bundling a fairly-lightweight dedicated library such as Papermill, we could (I believe - do chip in if you foresee complications) configure, manage & run Jupyter Notebooks as Cylc jobs by means of a trivial Rose application, as illustrated below.
Bare-bones implementation
With Papermill, notebooks can be executed according to their API using either Python or the shell [3]. Python seems more intuitive for processing parameters, so I would be inclined to choose it, but a CLI approach could also work I suspect. For example, a generic solution would be a
rose-app.conf
as follows:which would simply execute a Python file that processes the parameters and runs the notebook with these inputs using the
papermill
API:References