Closed edublancas closed 2 years ago
@edublancas currently it just throws an error while refactoring right? I'm thinking we can put it as an independent command + coupled by default as part of the refactor, thoughts?
No, it doesn't throw an error. The refactoring command would pass but the code won't run.
I think this could be semi-automated: run the notebook and try to fix as many problems as we can without user intervention. The problem is that it isn't simple to implement an efficient solution. For example, imagine that the notebook doesn't have a requirements.txt
, and it breaks on each import package
; we could capture the import errors and add the package names to requirements.txt
but then we'd have to restart the execution of the notebook over and over again.
I think we should leave this off right now as it isn't simple to fix.
As a first step, we need a command called test, to run the notebook prior to refactoring and making sure it can run without errors. Then we can have multiple options there, like matching before/after refactoring etc.
For notebooks, we can use either papermill {test_file_name.ipynb} {test_file_output.ipynb}
(or library functionpapermill.execute_notebook
) or jupyter nbconvert --to notebook --execute mynotebook.ipynb
and then check if the output contains any exception.
For python files, we can use exec(open("filename.py").read())
.
If it's an other error, show the guide for debugging notebooks
Should we just put a link that points to https://docs.ploomber.io/en/latest/user-guide/debugging.html?
if it's a function signature, recommend downgrading some libraries
Do you mean AttributeError
? For example when I run
import math
math.func()
It gives the error AttributeError: module 'math' has no attribute 'func'.
The way we do it currently with the ploomber package
is using nbconvert from ipynb to .py and then execute it, I think we should keep the same convention here (at least for now, for simlicity). @edublancas thoughts?
Should we just put a link that points to https://docs.ploomber.io/en/latest/user-guide/debugging.html?
Yes + probably show the error code.
Do you mean AttributeError? For example when I run
I think that what he means, usually on dependency discrepancies, there are missing functions, arguments, or it expects an extra member, AttributeError is probably a part of this set. @edublancas is that right??
A few things:
On second thought, I think we should name this soorgeon test
. The objective of this command is to help anyone fix their notebooks until they ensure they work properly. We'll start adding a bunch of features to this command but for the moment being, let's start with the basics of running the notebook.
To execute the notebook, let's use Ploomber. It already wraps papermill and supports .py files. Essentially, we'd wrap the notebook into a pipeline of a single task. The documentation has an example of how to do it with the Python API. This would cause soorgeon to have ploomber as a dependency but I think that's fine.
I think that what he means, usually on dependency discrepancies, there are missing functions, arguments, or it expects an extra member, AttributeError is probably a part of this set. @edublancas is that right??
Yes, attribute error is only one type of error. Another can be passing an argument that no longer exists. I'd say let's start with the executing logic and then we think of this error handling thing.
Can we run ploomber.tasks.NotebookRunner
before refactoring the notebook?
from pathlib import Path
from ploomber import DAG
from ploomber.tasks import NotebookRunner
from ploomber.products import File
dag = DAG()
NotebookRunner(Path('nb.ipynb'), product=File('report.html'), dag=dag)
When I tried to run this on soorgeon/examples/exploratory/nb.ipynb
, it will output that it requires a tags=["parameters"]
section.
Looks like it only works for refactored notebook. Not sure if my understanding is correct.
Maybe try using papermill to avoid circular dependency?
Unless we can utilize some of the other code in NotebookRunner
(it doesn't seem we need anything besides executing it at the moment).
When I tried to run this on soorgeon/examples/exploratory/nb.ipynb, it will output that it requires a tags=["parameters"] section.
good point. i forgot that ploomber requires the parameters cell.
so yeah, let's go ahead and use papermill. you'd need to add some extra logic if the input is a .py file, you can use jupytext to convert it into ipynb https://github.com/mwouts/jupytext
please also add both papermill and jupytext to the dependencies in setup.py
I think it would be easier to do the other way around and convert all .ipynb to .py instead, which is more straightforward than converting py to ipynb like what I did for the clean
command and I can catch exceptions with exec()
. I created a draft PR #74.
Yeah, this route works as well! Please add a few tests.
The PR is now ready for review.
I'm re-opening this since we should not use exec
. Exec won't handle IPython magics, which are pretty common in notebooks. We need to use papermill for this.
I know that it'll make detecting errors more difficult but it's the only way to support magics.
Then we can run papermill nb.ipynb {output_notebook_path} --log-output
and check if the exception name can be found in the output instead.
How should we handle {output_notebook_path}
?
Should we just abandon it (/dev/null
for Linux/mac or NUL
for win)
Then we can run papermill nb.ipynb {output_notebook_path} --log-output and check if the exception name can be found in the output instead.
That should work! We can use a local location or the ~/.ploomber dir and clean this once we're done running.
Should we just abandon it (/dev/null for Linux/mac or NUL for win)
Can you clarify?
I think it'd be best to run the notebook in the same folder. because the user might be interested in looking at the traceback to debug it (and this is better than overwriting the original notebook).
Example:
path/to/notebook.ipynb
path/to/notebook-soorgeon-test.ipynb
if there's an error we can tell the user: errors happened, check path/to/notebook-soorgeon-test.ipynb
check if the exception name can be found in the output instead.
yeah, I think something like:
if 'ValueError' in error_string:
# do stuff
will work. I'm unsure if there's a better way since we'll get the traceback as a string
For .py
files, after we convert them into temp notebook files, do we also want to save the notebook execution output notebook-soorgeon-test.ipynb
to the current working directory?
Also we need to set a kernel_name
for papermill.execute_notebook()
in case the metadata is missing, shall we just use kernel_name='python3'
?
For .py files, after we convert them into temp notebook files, do we also want to save the notebook execution output notebook-soorgeon-test.ipynb to the current working directory?
Good point, I think it should be controlled via the user as a parameter. I can see how it'll be useful in case it doesn't run.
kernel_name
I think that's fine, this is how it's done in binder: Python 3 (Ipykernel)
@94rain @edublancas I don't see any docstring?
what docstring?
Do the following lines count?
Sorry, docstring can mean tons of things. I meant a changelog entry with a docstring for the release notes.
ah. I'll edit the changelog, but yeah for other contributions let's add it
Before refactoring a notebook, the user must ensure that the original notebook runs. We should have a command to check whether it works and suggest actions if if there are some errors.
e.g.,
If the notebook fails because of a
ModuleNotFound
: suggest creating a virtualenv, and adding arequirements.txt
with the package nameIf it's an other error, show the guide for debugging notebooks
if it's a function signature, recommend downgrading some libraries
An alternative approach would be to let the user convert anyway and then help them fix the issues in the ploomber generated pipeline, this way they can leverage incremental builds for rapid iterations. we can also suggest them to add a sample parameter