kynan / nbstripout

strip output from Jupyter and IPython notebooks
Other
1.19k stars 94 forks source link

Support git-filter-repo #193

Closed LunarLanding closed 5 months ago

LunarLanding commented 6 months ago

https://github.com/newren/git-filter-repo is pointed to by git-filter-branch via a scary warning:

WARNING: git-filter-branch has a glut of gotchas generating mangled history
     rewrites.  Hit Ctrl-C before proceeding to abort, then use an
     alternative filtering tool such as 'git filter-repo'
     (https://github.com/newren/git-filter-repo/) instead.  See the
     filter-branch manual page for more details; to squelch this warning,
     set FILTER_BRANCH_SQUELCH_WARNING=1.

Apparently git-filter-repo is a python tool that can run python code directly on the text objects. Like so (found here, edited so it works for my case)

git filter-repo --path-glob '**/*.ipynb' --blob-callback '
import json
try:
    notebook = json.loads(blob.data)
    cleaned=False
    if (type(notebook) is dict) and ("cells" in notebook) and type(notebook["cells"]) is list:
        for cell in notebook["cells"]:
            if type(cell) is dict and "outputs" in cell and cell["outputs"]:
                cell["outputs"] = []
                cleaned=True
        if cleaned:
            print("cleaned")
            blob.data = (json.dumps(notebook, ensure_ascii=False, indent=1,
                                sort_keys=True) + "\n").encode("utf-8")
except json.JSONDecodeError as ex:
    pass
except UnicodeDecodeError as ex:
    pass
'

It would be nice to have something like this but doing the rewriting with nbstripout.

kynan commented 6 months ago

That is a great suggestion and one that has also crossed my mind before, in particular since we currently still mention git filter-branch in the README.

@LunarLanding Interested in sending a PR to document your approach in the README, potentially replacing the current recipe for git filter-branch?

LunarLanding commented 6 months ago

@kynan just did, because I wanted to keep everything in the same interpreter and minimize write/read to disk for performance, it is slightly involved, but still useful information I think.