ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

A utility to ensure the notebook runs #6

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

Before refactoring a notebook, the user must ensure that the original notebook runs. We should have a command to check whether it works and suggest actions if if there are some errors.

e.g.,

soorgeon run nb.ipynb

If the notebook fails because of a ModuleNotFound: suggest creating a virtualenv, and adding a requirements.txt with the package name

If it's an other error, show the guide for debugging notebooks

if it's a function signature, recommend downgrading some libraries

An alternative approach would be to let the user convert anyway and then help them fix the issues in the ploomber generated pipeline, this way they can leverage incremental builds for rapid iterations. we can also suggest them to add a sample parameter

idomic commented 2 years ago

@edublancas currently it just throws an error while refactoring right? I'm thinking we can put it as an independent command + coupled by default as part of the refactor, thoughts?

edublancas commented 2 years ago

No, it doesn't throw an error. The refactoring command would pass but the code won't run.

I think this could be semi-automated: run the notebook and try to fix as many problems as we can without user intervention. The problem is that it isn't simple to implement an efficient solution. For example, imagine that the notebook doesn't have a requirements.txt, and it breaks on each import package; we could capture the import errors and add the package names to requirements.txt but then we'd have to restart the execution of the notebook over and over again.

I think we should leave this off right now as it isn't simple to fix.

idomic commented 2 years ago

As a first step, we need a command called test, to run the notebook prior to refactoring and making sure it can run without errors. Then we can have multiple options there, like matching before/after refactoring etc.

94rain commented 2 years ago

For notebooks, we can use either papermill {test_file_name.ipynb} {test_file_output.ipynb} (or library functionpapermill.execute_notebook) or jupyter nbconvert --to notebook --execute mynotebook.ipynb and then check if the output contains any exception.

For python files, we can use exec(open("filename.py").read()).


If it's an other error, show the guide for debugging notebooks

Should we just put a link that points to https://docs.ploomber.io/en/latest/user-guide/debugging.html?

if it's a function signature, recommend downgrading some libraries

Do you mean AttributeError? For example when I run

import math
math.func()

It gives the error AttributeError: module 'math' has no attribute 'func'.

idomic commented 2 years ago

The way we do it currently with the ploomber package is using nbconvert from ipynb to .py and then execute it, I think we should keep the same convention here (at least for now, for simlicity). @edublancas thoughts?

Should we just put a link that points to https://docs.ploomber.io/en/latest/user-guide/debugging.html?

Yes + probably show the error code.

Do you mean AttributeError? For example when I run

I think that what he means, usually on dependency discrepancies, there are missing functions, arguments, or it expects an extra member, AttributeError is probably a part of this set. @edublancas is that right??

edublancas commented 2 years ago

A few things:

On second thought, I think we should name this soorgeon test. The objective of this command is to help anyone fix their notebooks until they ensure they work properly. We'll start adding a bunch of features to this command but for the moment being, let's start with the basics of running the notebook.

To execute the notebook, let's use Ploomber. It already wraps papermill and supports .py files. Essentially, we'd wrap the notebook into a pipeline of a single task. The documentation has an example of how to do it with the Python API. This would cause soorgeon to have ploomber as a dependency but I think that's fine.

I think that what he means, usually on dependency discrepancies, there are missing functions, arguments, or it expects an extra member, AttributeError is probably a part of this set. @edublancas is that right??

Yes, attribute error is only one type of error. Another can be passing an argument that no longer exists. I'd say let's start with the executing logic and then we think of this error handling thing.

94rain commented 2 years ago

Can we run ploomber.tasks.NotebookRunner before refactoring the notebook?

from pathlib import Path
from ploomber import DAG
from ploomber.tasks import NotebookRunner
from ploomber.products import File
dag = DAG()
NotebookRunner(Path('nb.ipynb'), product=File('report.html'), dag=dag)

When I tried to run this on soorgeon/examples/exploratory/nb.ipynb, it will output that it requires a tags=["parameters"] section.

Looks like it only works for refactored notebook. Not sure if my understanding is correct.

idomic commented 2 years ago

Maybe try using papermill to avoid circular dependency? Unless we can utilize some of the other code in NotebookRunner (it doesn't seem we need anything besides executing it at the moment).

edublancas commented 2 years ago

When I tried to run this on soorgeon/examples/exploratory/nb.ipynb, it will output that it requires a tags=["parameters"] section.

good point. i forgot that ploomber requires the parameters cell.

so yeah, let's go ahead and use papermill. you'd need to add some extra logic if the input is a .py file, you can use jupytext to convert it into ipynb https://github.com/mwouts/jupytext

please also add both papermill and jupytext to the dependencies in setup.py

94rain commented 2 years ago

I think it would be easier to do the other way around and convert all .ipynb to .py instead, which is more straightforward than converting py to ipynb like what I did for the clean command and I can catch exceptions with exec(). I created a draft PR #74.

idomic commented 2 years ago

Yeah, this route works as well! Please add a few tests.

94rain commented 2 years ago

The PR is now ready for review.

edublancas commented 2 years ago

I'm re-opening this since we should not use exec. Exec won't handle IPython magics, which are pretty common in notebooks. We need to use papermill for this.

I know that it'll make detecting errors more difficult but it's the only way to support magics.

94rain commented 2 years ago

Then we can run papermill nb.ipynb {output_notebook_path} --log-output and check if the exception name can be found in the output instead.

How should we handle {output_notebook_path}? Should we just abandon it (/dev/null for Linux/mac or NUL for win)

idomic commented 2 years ago

Then we can run papermill nb.ipynb {output_notebook_path} --log-output and check if the exception name can be found in the output instead.

That should work! We can use a local location or the ~/.ploomber dir and clean this once we're done running.

Should we just abandon it (/dev/null for Linux/mac or NUL for win)

Can you clarify?

edublancas commented 2 years ago

I think it'd be best to run the notebook in the same folder. because the user might be interested in looking at the traceback to debug it (and this is better than overwriting the original notebook).

Example:

if there's an error we can tell the user: errors happened, check path/to/notebook-soorgeon-test.ipynb

check if the exception name can be found in the output instead.

yeah, I think something like:

if 'ValueError' in error_string:
   # do stuff

will work. I'm unsure if there's a better way since we'll get the traceback as a string

94rain commented 2 years ago

For .py files, after we convert them into temp notebook files, do we also want to save the notebook execution output notebook-soorgeon-test.ipynb to the current working directory?

Also we need to set a kernel_name for papermill.execute_notebook() in case the metadata is missing, shall we just use kernel_name='python3'?

idomic commented 2 years ago

For .py files, after we convert them into temp notebook files, do we also want to save the notebook execution output notebook-soorgeon-test.ipynb to the current working directory?

Good point, I think it should be controlled via the user as a parameter. I can see how it'll be useful in case it doesn't run.

kernel_name

I think that's fine, this is how it's done in binder: Python 3 (Ipykernel)

Screen Shot 2022-07-19 at 13 19 48
idomic commented 2 years ago

@94rain @edublancas I don't see any docstring?

edublancas commented 2 years ago

what docstring?

94rain commented 2 years ago

Do the following lines count?

https://github.com/ploomber/soorgeon/blob/b5012df42725f8e7238dc3146215cc5fd6889a79/src/soorgeon/cli.py#L128-L140

idomic commented 2 years ago

Sorry, docstring can mean tons of things. I meant a changelog entry with a docstring for the release notes.

edublancas commented 2 years ago

ah. I'll edit the changelog, but yeah for other contributions let's add it