kynan / nbstripout

strip output from Jupyter and IPython notebooks
Other
1.19k stars 94 forks source link

Can I use nbstripout in Github action as a test ? #158

Open 12rambau opened 2 years ago

12rambau commented 2 years ago

I'm using the pre-commit hook of nbstripout and black to improve code quality of my python applications. As some of the maintainer are not installing pre-commit they are still pushing code with outputs.

Would it be possible to use nbstripout in Github action to check the presence of output ?

joaonc commented 2 years ago

This question is probably better suited for StackOverflow. That said, check out the --dry-run option. However, this currently always return 0, which may make it not usable as a GH action (not super familiar w/ these though, so may be wrong). https://github.com/kynan/nbstripout/issues/159

kynan commented 2 years ago

Interesting use case. Have you had a chance to explore this further?

12rambau commented 2 years ago

not yet, but that's definitely on my to-do list.

kynan commented 1 year ago

@12rambau Are you still interested in this? Feel like working on it?

12rambau commented 1 year ago

I'm still interested but I was swamped with many things all year, I'll do some open source work starting this week and this repository is on my todo list

kynan commented 1 year ago

Perfect! Keep me posted and let me know if you have questions, get stuck etc. :)

arobrien commented 1 year ago

I created a script to do this as a stop-gap measure until we can get this into nbstripout. I'm putting in a few Pull requests now to try and make that happen, starting with #176 to get tests working on windows, and then aiming to do a re-organisation to reduce code duplication and move functionality out of main() after that.

This allows me to call python nbstripout_wrapper.py --check notebooks in my CI, and users can run python nbstripout_wrapper.py notebooks to strip all of their notebooks manually if they don't want to implement a pre-commit hook for whatever reason.

It's a bit quick-and-dirty, and it ends up re-implementing (copy-paste) a lot of nbstripout code, but key improvements were to:

""""
Re-packages nbstripout functionality to be more useful for CI/CD pipelines.

Supports most of the same options as nbstripout, but does not support installation as a git hook.

Based on nbstripout version 0.6.1
This is currently very fragile to changes in nbstripout internal structure, so it is recommended to fix the version in
requirements.txt
"""

from argparse import ArgumentParser
from copy import deepcopy
import io
from pathlib import Path
from typing import Any, List
import warnings

from nbformat import NotebookNode, read, write, NO_CONVERT
from nbstripout import strip_output
from nbstripout._nbstripout import _parse_size

def process_args():
    parser = ArgumentParser()
    parser.add_argument(
        "--check",
        action="store_true",
        help="Run check without modifying any files. Returns 1 if files would have been modified.",
    )
    parser.add_argument("--keep-count", action="store_true", help="Do not strip the execution count/prompt number")
    parser.add_argument("--keep-output", action="store_true", help="Do not strip output", default=None)
    parser.add_argument(
        "--extra-keys",
        default="",
        help="Space separated list of extra keys to strip " "from metadata, e.g. metadata.foo cell.metadata.bar",
    )
    parser.add_argument(
        "--drop-empty-cells",
        action="store_true",
        help="Remove cells where `source` is empty or contains only whitepace",
    )
    parser.add_argument(
        "--drop-tagged-cells", default="", help="Space separated list of cell-tags that remove an entire cell"
    )
    parser.add_argument(
        "--strip-init-cells", action="store_true", help="Remove cells with `init_cell: true` metadata (default: False)"
    )
    parser.add_argument("--max-size", metavar="SIZE", help="Keep outputs smaller than SIZE", default="0")
    parser.add_argument("paths", nargs="*", help="Files or folders to check with nbstripout")
    return parser.parse_args()

ALLOWED_EXTENSIONS = [".ipynb"]

extra_keys = [
    "metadata.signature",
    "metadata.widgets",
    "cell.metadata.collapsed",
    "cell.metadata.ExecuteTime",
    "cell.metadata.execution",
    "cell.metadata.heading_collapsed",
    "cell.metadata.hidden",
    "cell.metadata.scrolled",
]

def read_notebook_file(p: Path):
    with io.open(p, "r", encoding="utf8") as f:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=UserWarning)
            return read(f, as_version=NO_CONVERT)

def write_notebook_file(nb: NotebookNode, p: Path):
    with io.open(p, "w", encoding="utf8", newline="") as f:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=UserWarning)
            write(nb, f)

def process_file(p: Path, is_check: bool, stripout_args: List[Any]):
    if p.suffix not in ALLOWED_EXTENSIONS:
        return []

    nb_input = read_notebook_file(p)

    nb_input_copy = deepcopy(nb_input)  # stip_output modifies input, so save a copy for comparison later
    nb_output = strip_output(nb_input, *stripout_args)

    if not is_check:
        write_notebook_file(nb_output, p)

    if not (nb_input_copy == nb_output):
        print(f"File: {p} was stripped by nbstripout.")
        return [p]
    return []

def process_folder(p: Path, is_check: bool, stripout_args: List[Any]):
    if p.is_dir():
        return [f for e in p.iterdir() for f in process_folder(e, is_check, stripout_args)]
    return process_file(p, is_check, stripout_args)

def main():
    args = process_args()
    is_check: bool = args.check
    if is_check:
        print("Checking Notebooks for outputs using nbstripout:")
    else:
        print("Cleaning Notebooks outputs using nbstripout:")

    stripout_args = [
        args.keep_output,
        args.keep_count,
        extra_keys,
        args.drop_empty_cells,
        args.drop_tagged_cells.split(),
        args.strip_init_cells,
        _parse_size(args.max_size),
    ]

    modified_files = []

    for p_str in args.paths:
        p = Path(p_str)
        if not p.exists():
            raise FileNotFoundError(p)
        modified_files.append(process_folder(p, is_check, stripout_args))

    modified_files = [f for arr in modified_files for f in arr]

    if is_check:
        if modified_files:
            print(f"{len(modified_files)} files would have been modified by nbstripout.")
            exit(1)
        else:
            print("Success: all Notebooks are already clean.")
            exit(0)

    print(f"{len(modified_files)} files were modified by nbstripout.")
    exit(0)

if __name__ == "__main__":
    main()
kynan commented 1 year ago

@arobrien Thanks for that contribution! I'm certainly open to extending / refactoring nbstripout to support that use case. I'll have to rely on contributors to do this work though as it's not a use case I have so I'm not really in a position to provide the requirements. It seems to me that you're motivated enough to get this implemented, which is great :)

More than happy to discuss how you'd like to approach this before you start sending PRs.

kynan commented 1 year ago

FWIW, I was reluctant to implement any functionality to recursively walk a directory tree or anything like that as you can simply use e.g. find to achieve the same (see #127 for additional thoughts), but I might be convinced otherwise to e.g. support Windows users.

arobrien commented 1 year ago

@kynan yes, I've got a little bit of capacity at the moment to help out.

As for approach, I think maintining current functionality is most important, but if we can make nbstripout work the same as black, at least as far as the Black The Basics help page then that would be great. Your licence seems to be pretty much exactly the same as the MIT licence used by black, so copying code shouldn't be a problem (I'm curious why you don't just use the standard MIT licence). Key features of black mentioned in that document:

I can see that in the strictest sense none of these features are necessary if you combine them with tools like find, or more complex command line scripts, but I would suggest that Black serves as a model of how this sort of destructive linting tool can be easily worked into both IDEs and Continuous Integration pipelines. Black also serves as a model to how to decide design questions such as option lookup hierarchy.

I see a potential roadmap for achieving this as being:

  1. Improve testing for current use cases
  2. Re-organise the code to separate out main areas of concern:
    • Entry point/command line parsing
    • File list iteration/processing
    • Modification of the file/JSON structure
    • Git integration/installation/status
  3. Implement features one at a time, in small steps:
    1. Complete configuration file processing started in #169
    2. Implement directory walk the same way as in black (process arguments into a file list, and then iterate over that list) https://github.com/psf/black/blob/d97b7898b34b67eb3c6839998920e17ac8c77908/src/black/__init__.py#L614
    3. Implement --check
    4. etc...
kynan commented 1 year ago

I've got a little bit of capacity at the moment to help out.

Great to hear!

I think maintining current functionality is most important

Fully agree!

If we can make nbstripout work the same as black, [...] then that would be great

Also agree :)

copying code shouldn't be a problem

I recommend against this, regardless of the license.

I'm curious why you don't just use the standard MIT licence

It is exactly the MIT License :)

kynan commented 1 year ago

@arobrien your roadmap looks reasonable, thanks!

12rambau commented 1 year ago

I see that folks have clearly taken over this matter and thanks for all progresses you made so far. I just want to make a small update on my initial use case.

The problem was the following: How to perform checks equivalent to the one done in pre-commits in the CI so that users that forgot to install it get a failed test. The solution was already existing in GitHub actions: https://github.com/pre-commit/action

I now use it in all my repositories and I get the test I was expecting for nbstripout: https://github.com/12rambau/sepal_ui/blob/452bc1df1277308c789bc324251464928e7e49fa/.github/workflows/unit.yml#L13

I leave it up to you to continue working on it or close it.

kynan commented 1 year ago

Thanks for the update @12rambau! It's not entirely clear to me where nbstripout is invoked in the example you linked to?

12rambau commented 1 year ago

the github action process is calling the .pre-commit-config.yaml file of the repository. I just wanted to show how easy it is to reproduce the same behavior as the local pre-commit checks.

nbstripout is called in my config file here: https://github.com/12rambau/sepal_ui/blob/main/.pre-commit-config.yaml

kynan commented 1 year ago

Gotcha, that's the connection I had missed! Do I understand correctly that you're happy with this workflow and don't require any (further) changes in nbstripout to support your use case?

12rambau commented 1 year ago

Yes, I'm more than happy with this workflow and yes from my side everything I need is already covered by the current implementation of nbstripout.