nteract / papermill

๐Ÿ“š Parameterize, execute, and analyze notebooks
http://papermill.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
5.92k stars 426 forks source link

Support for custom translators #412

Open bnaul opened 5 years ago

bnaul commented 5 years ago

Along the lines of #215, it seems like there are quite a few parameter types that would be desirable to pass as inputs (most notably pd.DataFrames) that are simple enough to translate to/from JSON. Doing the transformation manually every time is a bit of a headache; is there any current method to register a custom translator that would handle this conversion automatically? Monkey-patching translate https://github.com/nteract/papermill/blob/718f39e9012bc2cc14a0706801e2f2e934b0c1b6/papermill/translators.py#L80-L99 seems like the best option at the moment, but being able to explicitly specify translator at runtime seems a lot safer.

To be clear, this is not a proposal to automatically convert pandas objects (though I could certainly see an argument for that as well ๐Ÿ˜‰ ), just to expose a method for users to add their own serialization methods. dask's method for allowing custom serializers is a nice, straightforward example of something similar: https://distributed.dask.org/en/latest/serialization.html#id3

MSeal commented 5 years ago

You can register new translators easily enough. Notice at the bottom of that file we have a bunch of registration calls.

https://papermill.readthedocs.io/en/latest/extending-overview.html describes how one can register new engines and IO plugins. But I noticed we're missing https://github.com/nteract/papermill/blob/718f39e9012bc2cc14a0706801e2f2e934b0c1b6/papermill/engines.py#L32-L38 for translators. If you wanted to make a PR for that equivalent code in translators you could then do:

from setuptools import setup, find_packages
setup(
    # all the normal setup.py arguments...
    entry_points={"papermill.translators": ["python=translators:PandasPythonTranslator"]},
)

in your project to register the translations.

bnaul commented 5 years ago

@MSeal I started to take a stab at this but I hadn't realized that we also inject all of the parameter values into the notebook metadata: https://github.com/nteract/papermill/blob/master/papermill/parameterize.py#L104 So presently no matter what you do in the translator, any non-trivial parameter will fail with:

TypeError: Object of type DataFrame is not JSON serializable

Why exactly is there a need to store this extra copy of the parameters if they appear in the injected-parameters cell as well? I assume I'm just missing something about the execution flow that makes this necessary but unfortunately it seems like this inherently limits one to only simple JSON-compatible types.

MSeal commented 5 years ago

It was intended to give programmatic access to what parameters were set. You can't read what the user input was from the cell itself if there was manipulations. To make this work, we'd likely want to have a placeholder object for non-json fields being added or catch the invalid json input and skip saving the parameters to the metadata.