Open 12rambau opened 3 years ago
This question is probably better suited for StackOverflow.
That said, check out the --dry-run
option. However, this currently always return 0
, which may make it not usable as a GH action (not super familiar w/ these though, so may be wrong). https://github.com/kynan/nbstripout/issues/159
Interesting use case. Have you had a chance to explore this further?
not yet, but that's definitely on my to-do list.
@12rambau Are you still interested in this? Feel like working on it?
I'm still interested but I was swamped with many things all year, I'll do some open source work starting this week and this repository is on my todo list
Perfect! Keep me posted and let me know if you have questions, get stuck etc. :)
I created a script to do this as a stop-gap measure until we can get this into nbstripout. I'm putting in a few Pull requests now to try and make that happen, starting with #176 to get tests working on windows, and then aiming to do a re-organisation to reduce code duplication and move functionality out of main()
after that.
This allows me to call python nbstripout_wrapper.py --check notebooks
in my CI, and users can run python nbstripout_wrapper.py notebooks
to strip all of their notebooks manually if they don't want to implement a pre-commit hook for whatever reason.
It's a bit quick-and-dirty, and it ends up re-implementing (copy-paste) a lot of nbstripout code, but key improvements were to:
--check
, return 1 if any files would have been modified.""""
Re-packages nbstripout functionality to be more useful for CI/CD pipelines.
Supports most of the same options as nbstripout, but does not support installation as a git hook.
Based on nbstripout version 0.6.1
This is currently very fragile to changes in nbstripout internal structure, so it is recommended to fix the version in
requirements.txt
"""
from argparse import ArgumentParser
from copy import deepcopy
import io
from pathlib import Path
from typing import Any, List
import warnings
from nbformat import NotebookNode, read, write, NO_CONVERT
from nbstripout import strip_output
from nbstripout._nbstripout import _parse_size
def process_args():
parser = ArgumentParser()
parser.add_argument(
"--check",
action="store_true",
help="Run check without modifying any files. Returns 1 if files would have been modified.",
)
parser.add_argument("--keep-count", action="store_true", help="Do not strip the execution count/prompt number")
parser.add_argument("--keep-output", action="store_true", help="Do not strip output", default=None)
parser.add_argument(
"--extra-keys",
default="",
help="Space separated list of extra keys to strip " "from metadata, e.g. metadata.foo cell.metadata.bar",
)
parser.add_argument(
"--drop-empty-cells",
action="store_true",
help="Remove cells where `source` is empty or contains only whitepace",
)
parser.add_argument(
"--drop-tagged-cells", default="", help="Space separated list of cell-tags that remove an entire cell"
)
parser.add_argument(
"--strip-init-cells", action="store_true", help="Remove cells with `init_cell: true` metadata (default: False)"
)
parser.add_argument("--max-size", metavar="SIZE", help="Keep outputs smaller than SIZE", default="0")
parser.add_argument("paths", nargs="*", help="Files or folders to check with nbstripout")
return parser.parse_args()
ALLOWED_EXTENSIONS = [".ipynb"]
extra_keys = [
"metadata.signature",
"metadata.widgets",
"cell.metadata.collapsed",
"cell.metadata.ExecuteTime",
"cell.metadata.execution",
"cell.metadata.heading_collapsed",
"cell.metadata.hidden",
"cell.metadata.scrolled",
]
def read_notebook_file(p: Path):
with io.open(p, "r", encoding="utf8") as f:
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=UserWarning)
return read(f, as_version=NO_CONVERT)
def write_notebook_file(nb: NotebookNode, p: Path):
with io.open(p, "w", encoding="utf8", newline="") as f:
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=UserWarning)
write(nb, f)
def process_file(p: Path, is_check: bool, stripout_args: List[Any]):
if p.suffix not in ALLOWED_EXTENSIONS:
return []
nb_input = read_notebook_file(p)
nb_input_copy = deepcopy(nb_input) # stip_output modifies input, so save a copy for comparison later
nb_output = strip_output(nb_input, *stripout_args)
if not is_check:
write_notebook_file(nb_output, p)
if not (nb_input_copy == nb_output):
print(f"File: {p} was stripped by nbstripout.")
return [p]
return []
def process_folder(p: Path, is_check: bool, stripout_args: List[Any]):
if p.is_dir():
return [f for e in p.iterdir() for f in process_folder(e, is_check, stripout_args)]
return process_file(p, is_check, stripout_args)
def main():
args = process_args()
is_check: bool = args.check
if is_check:
print("Checking Notebooks for outputs using nbstripout:")
else:
print("Cleaning Notebooks outputs using nbstripout:")
stripout_args = [
args.keep_output,
args.keep_count,
extra_keys,
args.drop_empty_cells,
args.drop_tagged_cells.split(),
args.strip_init_cells,
_parse_size(args.max_size),
]
modified_files = []
for p_str in args.paths:
p = Path(p_str)
if not p.exists():
raise FileNotFoundError(p)
modified_files.append(process_folder(p, is_check, stripout_args))
modified_files = [f for arr in modified_files for f in arr]
if is_check:
if modified_files:
print(f"{len(modified_files)} files would have been modified by nbstripout.")
exit(1)
else:
print("Success: all Notebooks are already clean.")
exit(0)
print(f"{len(modified_files)} files were modified by nbstripout.")
exit(0)
if __name__ == "__main__":
main()
@arobrien Thanks for that contribution! I'm certainly open to extending / refactoring nbstripout
to support that use case. I'll have to rely on contributors to do this work though as it's not a use case I have so I'm not really in a position to provide the requirements. It seems to me that you're motivated enough to get this implemented, which is great :)
More than happy to discuss how you'd like to approach this before you start sending PRs.
FWIW, I was reluctant to implement any functionality to recursively walk a directory tree or anything like that as you can simply use e.g. find
to achieve the same (see #127 for additional thoughts), but I might be convinced otherwise to e.g. support Windows users.
@kynan yes, I've got a little bit of capacity at the moment to help out.
As for approach, I think maintining current functionality is most important, but if we can make nbstripout
work the same as black
, at least as far as the Black The Basics help page then that would be great. Your licence seems to be pretty much exactly the same as the MIT licence used by black, so copying code shouldn't be a problem (I'm curious why you don't just use the standard MIT licence). Key features of black mentioned in that document:
source_file_or_directory
on the command line, as well as -
to indicate stdin
, with options for include/exclude rules when walking a directory--check
pyproject.toml
I can see that in the strictest sense none of these features are necessary if you combine them with tools like find
, or more complex command line scripts, but I would suggest that Black serves as a model of how this sort of destructive linting tool can be easily worked into both IDEs and Continuous Integration pipelines. Black also serves as a model to how to decide design questions such as option lookup hierarchy.
I see a potential roadmap for achieving this as being:
--check
I've got a little bit of capacity at the moment to help out.
Great to hear!
I think maintining current functionality is most important
Fully agree!
If we can make nbstripout work the same as black, [...] then that would be great
Also agree :)
copying code shouldn't be a problem
I recommend against this, regardless of the license.
I'm curious why you don't just use the standard MIT licence
It is exactly the MIT License :)
@arobrien your roadmap looks reasonable, thanks!
I see that folks have clearly taken over this matter and thanks for all progresses you made so far. I just want to make a small update on my initial use case.
The problem was the following: How to perform checks equivalent to the one done in pre-commits in the CI so that users that forgot to install it get a failed test. The solution was already existing in GitHub actions: https://github.com/pre-commit/action
I now use it in all my repositories and I get the test I was expecting for nbstripout: https://github.com/12rambau/sepal_ui/blob/452bc1df1277308c789bc324251464928e7e49fa/.github/workflows/unit.yml#L13
I leave it up to you to continue working on it or close it.
Thanks for the update @12rambau! It's not entirely clear to me where nbstripout
is invoked in the example you linked to?
the github action process is calling the .pre-commit-config.yaml
file of the repository. I just wanted to show how easy it is to reproduce the same behavior as the local pre-commit checks.
nbstripout is called in my config file here: https://github.com/12rambau/sepal_ui/blob/main/.pre-commit-config.yaml
Gotcha, that's the connection I had missed! Do I understand correctly that you're happy with this workflow and don't require any (further) changes in nbstripout to support your use case?
Yes, I'm more than happy with this workflow and yes from my side everything I need is already covered by the current implementation of nbstripout.
Another update: as of release 0.8.0, there is a --verify
flag (thanks to @jspaezp, see #195), an enhanced dry run mode which exits with return code 1 if nbstripout
would have stripped anything and 0 otherwise.
Also, as of 6d181dc, --dry-run
only prints output if the notebook would have actually been modified.
@12rambau @arobrien @joaonc Does this satisfy your use cases?
it absolutely does ! thanks for working on it.
I'm using the pre-commit hook of nbstripout and black to improve code quality of my python applications. As some of the maintainer are not installing pre-commit they are still pushing code with outputs.
Would it be possible to use nbstripout in Github action to check the presence of output ?