nteract / papermill

📚 Parameterize, execute, and analyze notebooks
http://papermill.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
5.91k stars 425 forks source link

Running papermill on Jupytext's `.py` notebook #365

Open hoangthienan95 opened 5 years ago

hoangthienan95 commented 5 years ago

I'm running papermill on LSF and for some reason I can run it locally fine but when submitting the job it just gets stuck at 60% for 20 minutes. I used Jupytext to convert my notebook into .py notebook and run it like a script with default parameters and it finished in less than 1 minute. I think this has to do with starting/running the kernel. What can I do to diagnose this?

Also, is it possible to use papermill to run Jupytext's converted .py file as a script, using the current env's python interpreter instead of through Jupyter kernel? I usually develop scripts iteratively using Jupyter Notebook then have to run them as python script (partly of the error seen above, our cluster is finicky). I really love papermill's ability to parametrize scripts without writing verboseargparse/ CLI parser code and it comes with YAML parsing. I can just document my arguments using markdown cell in Jupyter notebook, which can be turned into comments in the .py file with Jupytext.

It would be amazing if I can use Jupytext to convert my notebook, papermill to inject parameter (by parsing the tags from the .py file, which would appear as a comment, ex: # + {"tags": ["parameters"]}), and run it like a normal script. The parametrized script can be a temp file and deleted after execution. Is it possible to do such thing? Where should I go first to try to implement such feature?

Basically, it would be running papermill with --prepare-only then convert to .py with jupytext, then run the script with no parameters. Right now, if I do .py as the output folder extension, it returns the file with the notebook's JSON, how do I make it pipe directly into Jupytext and possibly execute it?

MSeal commented 5 years ago

Hmm getting stuck for 20 minutes sounds odd. I've not encountered such a slowdown that was caused by papermill. You mind posting a notebook (maybe trimmed down?) that reproduces the issue? Does it happen if you just run papermill against the notebook locally? And finally what's your environment where you're executing (i.e. pip freeze)?

On the executing .py files with cell-like commenting, we don't have any support for that today. There was some talk at PyCon that this might be handy for hydrogen / vscode python cells being runnable by papermill, but no concrete plans to tackle the idea.

Today I'd say if you want to parameterize a notebook but execute it as a pure python module I would do something like papermill --prepare-only nb.ipynb param_code.ipynb ... ; nbconvert --to=python param_code.ipynb ; python param_code.py. The nbconvert library is made for these conversions and I work in that repo so I'm more familiar than Jupytext.

mwouts commented 5 years ago

Hello @MSeal , I am Marc, the main author of Jupytext, and I landed here following @hoangthienan95 's links. I like the idea of using papermill for executing and parametrizing scripts, and I am planning to write a short documentation on that, cf. https://github.com/mwouts/jupytext/issues/231.

On the executing .py files with cell-like commenting, we don't have any support for that today. There was some talk at PyCon that this might be handy for hydrogen / vscode python cells being runnable by papermill, but no concrete plans to tackle the idea.

May I mention that Jupytext, like the nbformat library, has a reads function? With respect to nbformat.reads it requires an additional argument: fmt, which in first approximation should be the extension of the file being read.

So if you were interested in offering support for the hydrogen / vscode scripts, one way would be to remove the requirement that the extension be ipynb/json in PapermillIO.read, and to replace this line with

fmt = os.path.splitext(notebook_path)[1]
nb = jupytext.reads(papermill_io.read(notebook_path), fmt=fmt, as_version=4)

Of course I'd understand that you may not want to take an additional dependency in papermill. So another way could be to allow input notebooks to be piped into papermill. I do not think this is currently possible. Is that something that you would be interested in?

betatim commented 5 years ago

Maybe the existing plugin mechanism (https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-new-i-o-handlers) is something to try out in order to support different input formats (right now they are mostly used to read from different kinds of storage). So there could be a jupytext IO engine.

mwouts commented 5 years ago

Hi @betatim , indeed, that seems to be the right way to do it... No extra dependency on either side... I like this, and will give it a try when time permits. One more question: should the plugin be a separate Python package, or do you expect it to become part of Jupytext itself?

betatim commented 5 years ago

One more question: should the plugin be a separate Python package, or do you expect it to become part of Jupytext itself?

No idea. I think having it be part of jupytext would make sense because then people who use jupytext "just have it".

MSeal commented 5 years ago

I agree with @betatim that the plugin system was made to support this. If you get that plugin working and it has good general value well I'd be open to adding the capability as an optional entrypoint dependency in papermill (like our azure / gcs / etc dependencies). We use this pattern to not require large external dependencies by default but if there is a common usage pattern to make it readily available with pip install papermill[my_extension_name].

If you want to try adding it to jupytext or a different python package I'd be glad to take a look and help with debugging issues you hit.

mwouts commented 5 years ago

Excellent! Thanks @MSeal .

We can test the idea of the extension at https://github.com/mwouts/papermill_jupytext. There, I have prepared a minimal package following the documentation.

I think a good plugin should not do more than jupytext.readf and jupytext.writef. But, I am afraid that what I have prepared is not going to work... Is it correct that the read method is expected to return not the notebook itself, but a string with the JSON representation of the notebook? Similarly, can I assume that file_content in the write method is the notebook, or should I assume that it is only a JSON string?

Also, is it possible to mix multiple extension? I.e. if we do write a jupytext extension for papermill, will it be possible to load a .py notebook from s3 ? (I owe this question to @rgbkrk).

mwouts commented 5 years ago

@MSeal , I have good news here: I have been able to code an extension that kind of works. You can give it a try on binder (see the README of https://github.com/mwouts/papermill_jupytext). You will be interested in the third section, the one that uses the experimental extension.

What I'd like to improve:

MSeal commented 5 years ago
  • Jupytext should set a kernel when doing to conversion of .py to .ipynb. Currently I have to specify it manually. I will work on that at mwouts/jupytext#230.

:+1:

  • Papermill issues a warning because the file extension is neither .ipynb nor .json. I would like to deactivate that warning. Can I change that in the extension itself?

Not today. I'd have to think about the best way to control custom extension warnings. Make a separate issue in papermill and we can figure out what to do about registering new extensions with the iorw handlers.

  • I do not like very much having to convert the notebook back to a JSON string in the extension. That was my question above: could I handle the notebook object to papermill in a more direct way?

We'd need to make a change in the PapermillIO and functions like load_notebook_node to handle non-text responses. Looks pretty easy to make more conditional for non-string return values.

  • Also, there is the case of Jupytext notebooks on s3. Would s3://txt://script.py be meaningful?

Hmm, so txt:// would normally denote a scheme indicating where a file can be found rather than a format. I would follow the URI docs about hierarchy assignment and instead suffix the path with ?fmt=txt rather than prefix with txt://. This would make the path something like a URL of s3://bucket/script.py?fmt=txt. Technically s3 and other stores can have ? in a key by specializing the suffix pattern should be fine and not be confusing to experienced users.

This relates to

Also, is it possible to mix multiple extension? I.e. if we do write a jupytext extension for papermill, will it be possible to load a .py notebook from s3 ? (I owe this question to @rgbkrk).

Yes. You can manipulate the path, then call the root handler again with the new path. You have to protect from stack overflow infinite loops if it's implemented incorrectly, but we do this internally to handle short links for service redirects to other s3 and http paths.

It'd follow something like this, after which any other registered iorw handler can be used in conjunction with your handler.

class TextFileHandler
    @classmethod
    def read(cls, path):
        resolved_path, format = cls.extract_format(path)
        contents = papermill_io.get_handler(resolved_path).read(resolved_path)
        # ... manipulate contents into the nb object
        return nb_str # or nb in the future

    @classmethod
    def write(cls, buf, path):
        resolved_path, format = cls.extract_format(path)
        # ... convert buf to the desired format
        text_buf = cls.convert_to_format(buf, format)
        return papermill_io.get_handler(resolved_path).write(buf, resolved_path)
mwouts commented 5 years ago

@hoangthienan95, @MSeal , do you think that piping the jupytext notebook into papermill is a good answer to this question? I mean, with papermill from master, one could do something like

cat script.py | jupytext --to ipynb --set-kernel - | papermill -p param1 value1 > executed_notebook.ipynb

Note that the --set-kernel option is not required if the kernel is already encoded in the script YAML header. Also, you could pass an explicit kernel name after the --set-kernel option (- means the kernel matching the current python env/or script language).

Also, @MSeal , we initially discussed the possibility of implementing a papermill extension for this, but then we saw that the extension framework was focused on filers, not on file extensions. Do you think it is still worth investigating extensions for this use case? Maybe it would be simpler to replace https://github.com/nteract/papermill/blob/4a86cf40318a529158e471b558d63fff032bf1b5/papermill/iorw.py#L376 with something like

_, ext = os.path.splitext(notebook_path)
if ext in ['.ipynb', '.json']:
    nb = nbformat.reads(papermill_io.read(notebook_path), as_version=4)
else:
    import jupytext
    nb = jupytext.reads(papermill_io.read(notebook_path), fmt=ext)

or any other variation that you would recommend?

betatim commented 5 years ago

Why not use the TextFileHandler class that was suggested in the previous comment for loading and saving Jupyter notebooks stored in a different format?

mwouts commented 5 years ago

Why not use the TextFileHandler class that was suggested in the previous comment for loading and saving Jupyter notebooks stored in a different format?

Oh sure, indeed I could give another try to that path - sorry I had not noticed the difference with our first attempt at writing an extension for papermill.

mwouts commented 5 years ago

@betatim , I have updated my papermill_jupytext plugin, and even published a version 0.0.1.

It uses Jupytext to open urls of the form txt://path/to/script.py. Nested handlers like s3://txt://script.py or txt://s3://script.py should work. And it takes care of injecting a kernel when none is found in the file (the kernel that matches the current interpreter for Python files if available, or the first kernel that matches the script language otherwise), so that executing the document really work (see the demo on Binder).

I did gave a try at implementing urls of the form path/to/script.py?fmt=py as @MSeal suggested above. Unfortunately that requires a change in papermill.iorw.get_handler as the handler is currently chosen based on the path prefix. Since we're discussing this, maybe my preference would be to simply register the TextFileHandler for any non-ipynb notebook extensions supported by Jupytext (.R, .py, .md...). @MSeal, what are your thoughts on the subject? What would you think of registering a papermill extension named ext:py to be used for all .py files?

MSeal commented 5 years ago

Sorry missed this thread update earlier @mwouts .

I think expanding the io handler to capture more situations natively sounds reasonable. The linked code is a bit hacky and could use a bit more formal structure for operation. I think maybe the handlers should have optional functions for pattern matching io requests. This would allow for more sophisticated matches, like extension matching.

If you want to start another issue and/or PR I'm happy to help move that along.

nehalecky commented 4 years ago

Hey All, thanks for this thread. Is there any more progress on this effort? @mwouts , you think it worthwhile to try out your https://github.com/mwouts/papermill_jupytext/blob/master/README.md plugin still? I know there's been a few releases to both jupytext and papermill since that time a while back, and wanted some guidance before I dip in! Thanks again, appreciate it!

mwouts commented 4 years ago

Hi @nehalecky , would you like to try using pipes instead? See for instance this article for a (tested) example. Let us know if that is convenient for you. Otherwise I think the plugin should still work (let me know if not).