nteract / papermill

📚 Parameterize, execute, and analyze notebooks
http://papermill.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
5.73k stars 421 forks source link

Support for multiple 'parameters' cells? #328

Open renefritze opened 5 years ago

renefritze commented 5 years ago

I'm currently considering using papermill to generate notebooks as input for pytest testing via nbval. However my parameterized 'source' notebook would in a literal programming style and I think it might be awkward to collect all parameters in a single cell. As far as I can tell papermill currently only finds the first cell tagged with 'parameters' and injects a cell right after that. Is this limitation to the first cell a hard design choice? Would you be open to extend it otherwise? Does it seem feasible even?

mbrio commented 5 years ago

I would love to weigh in on this too. I have begun working with papermill in conjunction with sparkmagic and have found the need to utilize multiple parameter cells. Because Papermill doesn't support it, I've hacked it in such a way that one set of parameters are being filled in using environment variables. My use-case is that I want to pass in parameters for cells with a %spark magic line and a %local magic line separately. %local parameters set up all of the configuration like Livy URLs needed to start the %spark session locally; and the %spark parameters are run externally on the Spark cluster. Currently I am forced to pass the %local parameters (since they are fewer) as environment variables.

MSeal commented 5 years ago

As context the decision was keep-it-simple-stupid as more complex or normalized options would be challenging to integrate with existing UIs / interfaces. It also means it's easy to identify everything the user supplied in one place (though it's also in the notebook metadata).

We're always open to proposals for improvement. My only constraint is that it stay simple and easy to use.

I'm curious about two aspects of multiple parameter cells.

First, when using the parameter cell today what's causing the friction to collecting parameters in one place and referring to them as needed? Is it just noisy when there's many inputs. I use magics with parameters as well and just lean on magic arguments to pass the particular parameter through to sql/presto/spark/bash.

Second, how would you imagine specifying multiple parameters? How would you choose which cell to place them and how would the user express the differences? It sounded like there where some ideas about how this might look.

mbrio commented 5 years ago

Completely understand KISS, but I think you can solve the problem while still keeping it stupid, I tried to attach the ipynb but it won't let me, so below I'll try and write out what you would see in the cells:

Because I can not pass two separate sets of parameters I must pass in my Livy configuration via environment variables. These variables must be set locally so that I can use $* variables within the %spark add magic function down below. These parameters are used to setup the Spark environment and must be run locally.

Separate from these settings, I need parameters to be set for the script that is meant to be run once the Spark session is setup via Livy via %spark add. You can see an example of these parameters below under the heading Parameters.

Looking at how the parameters are set in the pm.execute_notebook function, I think it would be possible to utilize kwargs:

pm.execute_notebook(
    notebook_path,
    out_path,
    parameters=parameters,
    parameters_spark=parameters_spark,
    parameters_some_other_tag=parameters_some_other_tag
)

In this case the kwargs and therefore the tags would be parameters, parameters_spark, parameters_some_other_tag.

IPYNB

has metadata tag parameters_spark on this cell and is a local cell so this runs locally

%%local

import os

livy_server_url = os.environ['LIVY_SERVER_URL']
livy_session_name = os.environ['LIVY_SESSION_NAME']

runs locally

%load_ext sparkmagic.magics

runs locally

%%spark config

{
    "conf": {
        "spark.pyspark.driver.python": "python3.6",
        "spark.pyspark.python": "python3.6",
        "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
        "spark.sql.shuffle.partitions": "2000",
        "spark.default.parallelism": "2000",
        "es.index.auto.create": "true",
        "spark.yarn.executor.memoryOverhead": "6144M",
        "spark.executor.memory": "6144M",
        "spark.driver.memory": "6144M"
    }
}

runs locally

%spark add -s $livy_session_name -l python -u $livy_server_url

Parameters

has metadata tag parameters on this cell and is a pyspark cell so this runs remotely

input_path = None
output_path = None

is a pyspark cell so this runs remotely

import pyspark.sql.functions as F
spark.read.parquet(input_path).write.parquet(output_path)

Cleanup

runs locally

%spark cleanup
mbrio commented 5 years ago

To speak to one of your previous statements, another reason why splitting parameters up might be helpful is to categorize and document many settings. For instance I have some notebooks for automating tensorflow models, some of these models have many hyperparameters, being able to split blocks of similar parameters up for categorization and documentation is handy. Admittedly this could still be done with the current implementation by assigning the parameters tag to the last cell, but I think it would make more intuitive sense if the categories were tagged separately.

renefritze commented 5 years ago

First, when using the parameter cell today what's causing the friction to collecting parameters in one place and referring to them as needed?

Our use case would be tutorial-like notebooks where we gently guide a new user through setting up some simulation, where I feel it would be nicer if parameters show up in cells only once they're needed. Those notebooks however we would then also like to use in tests where we exercise the script over the some subset of the possible user set inputs.

Second, how would you imagine specifying multiple parameters? How would you choose which cell to place them and how would the user express the differences? It sounded like there where some ideas about how this might look.

I had very naively thought the mechanism might not have to change from the outside. The pm.execute_notebook would still be the same and the parameters argument would be consumed as necessary, ie. you just specify all eventually needed inputs there and not differentiate.

MSeal commented 5 years ago

The pattern suggested seems reasonable. I'm tied up in a few other threads atm, but I think we could continue the conversation around a PR if someone wants to take a stab at implementation.

mirekphd commented 4 years ago

Lack of support for multiple 'parameters' cells makes the injected payloads unnecessarily intensive / large under production loads in this otherwise very well designed package, that as we found out scales well to production-quality (and well, size...) ML modeling pipelines, with hundreds of cells and tens of external custom functions.

One would wish to have at least two injection points to supply new parameter values from an outside controller notebook to the input notebook at two different stages in the modeling pipeline:

The current support of only single injection point requires us to query the object storage in the external controller notebook (where execute_notebook() is run) and inject all possible parameters at once (with some of them modified): that's hundreds of parameters (including lists), instead of just a diff of 1-4 params (time period settings that change every time the archived model is simply re-fit with new data, without altering other settings of the pipeline).

MSeal commented 4 years ago

I'd still hesitate to add multiple parameterizations for the reasons listed earlier in the thread. That being said how would you think it should look as an interface / reference within papermill? Are you able to write a papermill extension to meet your needs? If not I do think we should at least make papermill able to register what's needed to perform more complicated patterns for specialized use-cases.

In similar situations where I've used papermill for modeling I've seen splitting the model concerns into a couple notebooks that can be executed independently and passing the location of data results between them. It's a similar solution to what's done for very large ETL patterns where you need isolation of concerns and efficient reproducible stages of execution. The cost is you have to persist intermittent data, usually as an arrow buffer or some other efficiently serialized object, but the wins are much more reliable and maintainable processes. Hopefully that helps you in this situation.

mirekphd commented 4 years ago

splitting the model concerns into a couple notebooks that can be executed independently and passing the location of data results between them. It's a similar solution to what's done for very large ETL patterns where you need isolation of concerns and efficient reproducible stages of execution. The cost is you have to persist intermittent data, usually as an arrow buffer or some other efficiently serialized object, but the wins are much more reliable and maintainable processes. Hopefully that helps you in this situation.

Such a split of a modeling pipeline into multiple parts like data ingestion, pre-processing, modeling, validation, and model deployment / persistence is probably the best solution for all who are running such overcomplicated notebooks like your pipeline, that requires injection of multiple parameters. We took the first step towards this goal and separated a lot of code into a custom python library, but it's still not radical enough. Splitting the monolithic notebook into several logical components would allow us to manage the workflow better and automate even more.

We were long considering adopting Airflow or Luigi, but these tools seem to be too opinionated and biased towards batch processing of everything in .py files (as opposed to a mix of notebooks and py libraries). Papermill seems to be much more closer to the data scientists practice, much frendlier to use that any of these frameworks, with master notebook(s) instead of master server to log on to, and thus no extra maintenance required and even with jargon kept to a minimum (finally no DAGs and pretty colored graphs), so it was much faster to adopt, with immediate improvements in productivity visible from day two. Many thanks for open sourcing it!

In this particular scenario we managed with just one injection point that papermill offers. It is just sufficient to extract all pipeline settings in one go from the object storage (in our case minIO where MLflow stores these artifacts during modeling pipeline runs, and where we can later access these artifacts from any python notebook using minio client library). So we query from papermill's master notebook our model artifacts store for a JSON file with all pipeline settings (for a particular run of this pipeline that produced model we desire to reproduce or improve), and then papermill will inject all these settings, with appropriate changes to some of them, as desired.

The input notebook required virtually no adaptation, just creating a tagged "parameters" cell with defaults of the settings that will be changed. It works beautifully, this injection of very complex, sometimes nested dictionaries of parameters, we even inject a new key 'settings_injected_by_papermill' to be able to quickly inspect this injection's payload (parameters that actually got changed) also in the artifacts storage (where all settings will be preserved during pipeline run). It is a quick way to document changes made in this particular run and also the reference point - the original run ID (uniquely allocated by mlflow to every pipeline run). In fact having only one injection point in papermill made the process simpler and reduced the amount of changes in data science ("input") notebooks that do not have to contain any code to query the artifact storage for settings of historical pipeline runs (because the code has been moved to the master notebook(s) from which papermill functions are executed). The limitation proved useful after all:).

MSeal commented 4 years ago

FYI https://airflow.apache.org/docs/stable/howto/operator/papermill.html is used a lot to have DAGs executing individual notebooks if you haven't seen it.

Glad the parameterization for papermill worked well for you. We do a similar injection pattern with a master notebook to run integration test suites against our scheduler at Netflix.

Papermill should extend from direct call pattern to true DAG orchestrators easily when you end up needing more complex DAG patterns or need other orchestration concerns by making each node a papermill call with all it's relevant job parameters. This works surprisingly well and has scaled in use-cases without needing any changes to the notebooks in question over time.

mirekphd commented 4 years ago

without needing any changes to the notebooks in question over time.

In our use case one type of change turned out to be required: moving all parameters conditionally defined on the basis of other "base" parameters below papermill's injection point. A typical use case was auto-selection of some derivative parameters such as metrics depending on the model type (regression vs. binary classification), designed to increase automation (reduce the number of manual parameter changes). Before refactoring any changes made to the conditioning parameter (here: model type) from papermill could not be propagated to the derivative parameters (here: metrics), because the condition was executed earlier in the code flow, using default values pre-existing in the input notebook, and unmodified by papermill. So what was needed was a separate section with all "auto-gen" params, placed below the papermill's injection point and all such "definitions" (which were in fact already programming code) to be collected there and then all worked fine.