ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

Add support for `soorgeon clean` command #52

Closed 94rain closed 2 years ago

94rain commented 2 years ago
94rain commented 2 years ago

Do we still want to keep the following, or just move forward with only the filepath?

soorgeon clean {task-name}
edublancas commented 2 years ago

just filepath

click has some utilities to validate them: https://click.palletsprojects.com/en/8.1.x/arguments/#file-path-arguments

94rain commented 2 years ago

I pushed the changes but the integration tests are failing (Kaggle Unauthorized error). I am looking into the issue.

edublancas commented 2 years ago

that's ok, those are failing because you're missing an API key, no need to do anything. I'll review your code later

edublancas commented 2 years ago

hi i found an issue: when cleaning a ipynb file, the notebook's metadata gets lost.

to reproduce:

ploomber examples -n templates/ml-basic -o ml
cd ml
pip install requirements.txt

# reformat fit.py -> fit.ipynb
ploomber nb -f ipynb

# check that the pipeline runs
ploomber build -f

# clean
soorgeon clean fit.ipynb

# now it's broken
ploomber build -f
Error: Notebook does not contain kernelspec metadata and kernelspec_name was not specified, either add kernelspec info to your source file or specify a kernelspec by name. To see list of installed kernels run "jupyter kernelspec list" in the terminal (first column indicates the name). Python is usually named "python3", R usually "ir"

ipynb files store metadata that tells jupyter which kernel to use, looks like that data is not preserved. I think jupytext has some settings for this, so we need to ensure that when reading and writing the file we're keeping the metadata

please also add a test case: compare the original file's metadata with the metadata in the clean version and ensure they match

94rain commented 2 years ago

Based on the documentation,

notebook_metadata_filter: By default, Jupytext only exports the kernelspec and jupytext metadata to the text files.

Not sure why it is not preserved. Looking into it.

94rain commented 2 years ago

I was not able to run the given example. (my ploomber version is 0.19.5.dev0).

Error message

```shell Loading pipeline... fatal: bad revision 'HEAD' Executing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00, 2.23cell/s] Building task 'fit': 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:19<00:00, 4.78s/it] =========================================================== DAG build failed ============================================================ -------------- NotebookRunner: fit -> MetaProduct({'model': File('output\\model.pickle'), 'nb': File('output\\nb.html')}) --------------- ------------------------------------------------------- C:\su22\test\ml\fit.ipynb ------------------------------------------------------- Traceback (most recent call last): File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 18, in parse_json nb_dict = json.loads(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\json\__init__.py", line 346, in loads return _default_decoder.decode(s) File "C:\softwares\miniconda3\envs\ploomber\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\softwares\miniconda3\envs\ploomber\lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 190 column 411 (char 5533) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "c:\su22\ploomber\src\ploomber\tasks\abc.py", line 554, in _build res = self._run() File "c:\su22\ploomber\src\ploomber\tasks\abc.py", line 663, in _run self.run() File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 833, in run self._converter.convert() File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 199, in convert self._from_ipynb(self._path_to_output, self._exporter, File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 254, in _from_ipynb nb = nbformat.reads(path.read_text(), as_version=nbformat.NO_CONVERT) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\__init__.py", line 88, in reads nb = reader.reads(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 72, in reads nb_dict = parse_json(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 21, in parse_json raise NotJSONError(("Notebook does not appear to be JSON: %r" % s)[:77] + "...") from e nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n {\n "cell_type": "m... ploomber.exceptions.TaskBuildError: Error building task "fit" =========================================================== Summary (1 task) ============================================================ NotebookRunner: fit -> MetaProduct({'model': File('output\\model.pickle'), 'nb': File('output\\nb.html')}) =========================================================== DAG build failed ============================================================ ```

Instead, I ran the example in examples/machine-learning. The kernelspec info is indeed still there. For clean.ipynb, when I convert it to py file, looks like isort is sorting the files with changes to the position of tags: image

We need to find a way to prevent isort from doing this or just get rid of isort.

edublancas commented 2 years ago

I was not able to run the given example. (my ploomber version is 0.19.5.dev0).

interesting, can you share the steps you followed? I'll try to reproduce it

We need to find a way to prevent isort from doing this or just get rid of isort.

good catch. moving that tag isn't really an issue but others are important, can you run a quick test and tell me what happens with the parameters tag. in any example, you'll see something like this:

# %% tags=["parameters"]
upstream = None
product = None

add an import statement below the tag:

import math

# %% tags=["parameters"]
import pandas as pd
upstream = None
product = None

then apply the clean, does the import move with the tag? I'm guessing it'll yield something like:

import math

# %% tags=["parameters"]
import pandas as pd

# ...other imports?

upstream = None
product = None
94rain commented 2 years ago

The steps I followed (on Windows):

conda activate soorgeon
ploomber examples -n templates/ml-basic -o ml
cd ml
pip install -r requirements.txt
ploomber nb -f ipynb
ploomber build -f

Then it gives me the error message:

Error message

```shell Loading pipeline... fatal: bad revision 'HEAD' Executing: 100%|███████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00, 2.23cell/s] Building task 'fit': 100%|█████████████████████████████████████████████████████████████████████████████████| 4/4 [00:19<00:00, 4.78s/it] =========================================================== DAG build failed ============================================================ -------------- NotebookRunner: fit -> MetaProduct({'model': File('output\\model.pickle'), 'nb': File('output\\nb.html')}) --------------- ------------------------------------------------------- C:\su22\test\ml\fit.ipynb ------------------------------------------------------- Traceback (most recent call last): File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 18, in parse_json nb_dict = json.loads(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\json\__init__.py", line 346, in loads return _default_decoder.decode(s) File "C:\softwares\miniconda3\envs\ploomber\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\softwares\miniconda3\envs\ploomber\lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 190 column 411 (char 5533) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "c:\su22\ploomber\src\ploomber\tasks\abc.py", line 554, in _build res = self._run() File "c:\su22\ploomber\src\ploomber\tasks\abc.py", line 663, in _run self.run() File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 833, in run self._converter.convert() File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 199, in convert self._from_ipynb(self._path_to_output, self._exporter, File "c:\su22\ploomber\src\ploomber\tasks\notebook.py", line 254, in _from_ipynb nb = nbformat.reads(path.read_text(), as_version=nbformat.NO_CONVERT) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\__init__.py", line 88, in reads nb = reader.reads(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 72, in reads nb_dict = parse_json(s, **kwargs) File "C:\softwares\miniconda3\envs\ploomber\lib\site-packages\nbformat\reader.py", line 21, in parse_json raise NotJSONError(("Notebook does not appear to be JSON: %r" % s)[:77] + "...") from e nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n {\n "cell_type": "m... ploomber.exceptions.TaskBuildError: Error building task "fit" =========================================================== Summary (1 task) ============================================================ NotebookRunner: fit -> MetaProduct({'model': File('output\\model.pickle'), 'nb': File('output\\nb.html')}) =========================================================== DAG build failed ============================================================ ```

For the quick test, yes your guess is correct. image

edublancas commented 2 years ago

great work!