ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

`soorgeon clean` command deletes existing output #81

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

when running soorgeon clean we convert the notebook to .py and then back to .ipynb the ploomber is that cell outputs are lost. What we want instead is to keep the outputs as they were. I think there might be some challenges since running isort maybe cause the clean notebook to have extra cells (and potentially running black, too). so we need to find a solution. I think the new jupyter notebook format adds unique cell ids, so we can use that (if the notebook does not have the cell ids, maybe loading it with nbformat.load will add them). worst case: we create our cell ids

94rain commented 2 years ago

Why do we need unique cell ids?

I am thinking about sorting imports cell by cell using isort.code("import b\nimport a\n").

import nbformat
import isort
nb = nbformat.read('nb.ipynb', as_version=4)
nb.cells[i]['source'] = isort.code(nb.cells[i]['source'])
edublancas commented 2 years ago

applying isort cell-by-cell will solve the issue since it guarantees that the number of cells will be the same, but it might get too slow; maybe try benchmarking with a notebook of say 100 cells. and see the runtime of applying it all at once vs cell-by-cell.

94rain commented 2 years ago
import ploomber
from ploomber.util.default import extract_name
from ploomber.telemetry import telemetry
from math import log
from ploomber.cli.parsers import CustomParser
from ploomber.dag.plot import choose_backend
import numpy as np
import click
from ploomber.cli.io import cli_endpoint
from pathlib import Path

I tried duplicating the above cell x1000 times, it only takes 1.284s for cell-by-cell isort to be completed.

https://colab.research.google.com/drive/1LAmzGmIDVmxmrD2Zvs-HEcPLQIWgValR?usp=sharing (Downloading the notebook itself and upload it with name test.ipynb)

idomic commented 2 years ago

Try also running it at scale over cells that are misaligned and requires lots of isort work. I think if it stays around the same time, that should work. Regardless you can start a PR.

94rain commented 2 years ago

For 250 lines of imports * 1000 cells, it takes 6.38s.

edublancas commented 2 years ago

Ok, I thought through this.

isort is giving a lot of trouble because it can move cells around. running it cell-by-cell is also troublesome because of runtime. It's possible to fix this if we add unique cell identifiers, but I thought of a better way:

let's turn isort off for now and just make it work with black - black should not generate new cells, so keeping the output intact should be ok. Once we have that working, we can tackle the isort problem.

I think what we can do is to parse the notebook's content and move all the imports to the top cell, then we can apply isort there, this is better since moving the imports to the top is a better practice than having them all around the notebook.

thoughts @idomic ?

idomic commented 2 years ago

I like this optimization. We also have black integrated in too. Let's try this approach out

94rain commented 2 years ago

Could moving all imports to the first cell cause any unintended effect?

Like imports within with statements or functions

with warnings.catch_warnings():
    warnings.simplefilter('ignore', FutureWarning)
    from papermill.parameterize import parameterize_notebook
edublancas commented 2 years ago

@94rain: true, let's not move imports then.

I realized black added support for jupyter notebooks already 🤯 https://github.com/psf/black#installation

So I think let's just call whatever they implemented, and drop isort since implementing it properly will take more time.

so right now soorgeon clean will not be that useful since it'll only call black, but we can later work on other improvements

94rain commented 2 years ago

I removed isort.

Also I updated black to black[jupyter] and removed the temp file creation process, but I found that markdown files are still not supported by black, causing the following test to fail.

https://github.com/ploomber/soorgeon/blob/9c5d86febc1925ae2f3bdd6d59ec2ff1a543ec78/tests/test_cli.py#L457-L476

I updated temp file creation condition to .md only (create_temp = task_file.suffix == ".md").

Initial PR opened at #83.

edublancas commented 2 years ago

but I found that markdown files are still not supported by black, causing the following test to fail.

here's it's better to convert everything to ipynb using jupytext. it'll handle md, ipynb and py in percent format and any other format supported by jupytext

Im guessing black's integration with notebooks takes an ipynb as input right?

also please pin black's version, since notebook support was added recently and we wanna ensure users have a recent version, so look into black's changelog and edit the setup.py

94rain commented 2 years ago

Im guessing black's integration with notebooks takes an ipynb as input right?

Yes. With black[jupyter], we can use black nb.ipynb directly.

here's it's better to convert everything to ipynb using jupytext

The current workflow is to convert ipynb to py. Do we want to do py->ipynb instead?

Sure I will pin a recent version for black.

edublancas commented 2 years ago

ok so first some clarifications. jupytext allows representing jupyter notebooks in several formats: .py, .md, etc. - we want to support all of them ( we can do it easily with jupytext). so when I say "a notebook" I mean any file that can be read by jupytext as a notebook

Do we want to do py->ipynb instead?

yes, because the .py file might contain magics. if we just apply regular black, it won't work. If we convert py to .ipynb, and then apply black (the version that takes a notebook as an input), magics will be appropriately handled.

the only edge we need to handle is when the input file is not an actual notebook. for example a .py that does not have the # %% markers to represent notebook cell splits. I think we can use this:

https://github.com/mwouts/jupytext/blob/12964ef42783722815cd5d12df47d5ce67c1b323/jupytext/formats.py#L302

and determine if the input file is valid. if so, then convert into a notebook and apply black for notebooks. if it's invalid but it is a .py file, then we apply regular black. if invalid and not .py, throw an error

Yes. With black[jupyter], we can use black nb.ipynb directly.

check which function is called internally, and let's call the same function (instead of using subprocess)

94rain commented 2 years ago

[1] and determine if the input file is valid. if so, then convert into a notebook and apply black for notebooks. if it's invalid but it is a .py file, then we apply regular black. if invalid and not .py, throw an error

How about lint? flake8 only applies to py files and still needs the logic of ipynb->py. If I update get_file method to do [1], lint will break. How about two methods: get_py_file for lint and get_ipynb_file_when_applicable for clean?

edublancas commented 2 years ago

good point. yeah, we need two methods. for flake8 we always need to convert to .py

94rain commented 2 years ago

the only edge we need to handle is when the input file is not an actual notebook. for example a .py that does not have the # %% markers to represent notebook cell splits

Looks like guess_format will still treat a plain py file as notebook light format. It will be still converted to a notebook. Can we just convert all py files to ipynb for cleaning? If so, we can just have one method get_file(task_file, write=False, output_ext=".ipynb") with a new argument output_ext for lint and clean.

94rain commented 2 years ago

the only edge we need to handle is when the input file is not an actual notebook. for example a .py that does not have the # %% markers to represent notebook cell splits

Looks like guess_format will still treat a plain py file as notebook light format. It will be still converted to a notebook. Can we just convert all py files to ipynb for cleaning? If so, we can just have one method get_file(task_file, write=False, output_ext=".ipynb") with a new argument output_ext for lint and clean.

@edublancas I updated the PR #83 with the above change for now, with all tests passing. Do we still need to instead keep plain py from conversion to ipynb?

edublancas commented 2 years ago

closed by #83