Improve git-tracking for Jupyter notebooks

aufdenkampe commented 4 years ago

As we were working on the testing system (#31), @rheaphy, @steveskrip, @ptomasula and I discussed on a call the challenges of git-tracking Jupyter notebooks, because output cells are updated every time they are run even if the Python code or markdown doesn't change.

Let's work out a good way to save and track Jupyter notebooks.

Here's @ptomasula's HSP2 Potential Solutions for Python Notebooks in GitHub email with background and options:

I did a bit of digging into a potential solution to dealing with merging python notebooks in git. I found three potential solutions, which are outline below, but I’m personally leaning towards option 2. Option 2 is extremely easy to implement, solves the immediate merging challenges Bob was describing, and while there is a slight potential for issues around resolving merge conflicts between binary files, those can be largely avoided by coordinating our efforts (which we’ve done very well thus far). I’d be curious to hear thoughts from the rest of the group. I’d also note that whichever option we choose, isn’t necessarily set in stone. If we find the approach isn’t working for us we can change it down the road. We can also deviate from these options if anyone has great suggestion for a solution.

Background/Problem Python notebooks are stored as JSON, which provides for source tracking. However; when a notebook cell is run, certain cell attributes (output, execution count, etc.) are updated. This caused a number of impacts including;

Added difficultly managing merge conflicts (manual line by line process)

Larger and somewhat unruly commits

More difficult to review. Important changes in code vs less critical (trivial for the purposes of source tracking?) changes resulting from running a cell both appear in a diff.

Option 1 - Strip output block out prior to commit Python notebooks are stored as JSON. This makes it fairly easy to read and programmatically strip out the pieces of the document that are causing the issues described above. We could either write or use an existing script to accomplish this.

Pros

Fairly easy to implement. There already appears to be a number of tools that solve do this (https://pypi.org/project/nbstripout/ or https://github.com/toobaz/ipynb_output_filter). It would also not be a big lift to develop something to do this if we need to.

Still allows us to source track code changes in the cells contents (‘source’ attribute).

Cons

We lose the ability to share the output directly in the notebook, which may be of some value to users.

Extra step to committing code. Need to runs conversion tool prior to commit. (We might be able to get around this using the gitattribute filter, but I haven’t an experience using that)

Option 2 - Enable notebooks to be handled as binary Utilized the git attributes for flag all or select notebook files as binary. This would overwrite the entire file upon commit, and reduce conflict resolution to which version to use (instead of manually resolving lines)

Pros

Easy to implement

Allows for the output from the notebooks to be shared

Cons

Slight potential to overwrite code changes because of how merge conflicts are handled between binary files (i.e. must use either my file our their file)

Less explicit tracking of code changes, but could still tease them out by comparing versions

Option 3 – Use a merge management tool Use a tool to ease the merge process. The most promising one I've came across is nbdime (https://github.com/jupyter/nbdime), which was developed by the jupyter team.

Pros

Potential for best of all both world approach, easier to manage conflicts while still retaining change tracking on a line by line basis

Cons

Appears to be console only (at least nbdime does, but there may be other tools out there)

Doesn't solve unruly commits when looking at them on GitHub

Still requires some level of manual conflict resolution (it’s just drastically reduced)

aufdenkampe commented 4 years ago

Email response from @rheaphy:

Thank you for this research! I too lean toward option 2.

Even if we save Notebooks as binary data, perhaps we can still use the Jupyter nbdive to perform the diff's. It was designed to handle the JSON, embedded HTML, and other junk. I haven't used it since it assumes you are using Git and until the last couple of weeks, I was using mercurial.

aufdenkampe commented 4 years ago

@rheaphy, my commit https://github.com/LimnoTech/HSPsquared/commit/32c93eff0dd3671a5c7c57e420f3909cf71e3dfa should now enable @ptomasula's Option 2 and therefore allow us to use this repo to exchange the Jupyter notebooks that are an essential component of #31.

aufdenkampe commented 3 years ago

With merging #43 into Master, we can close this!

aufdenkampe commented 3 years ago

Our implementation of Option 2 - Enable notebooks to be handled as binary, described above, is no longer working sufficiently well, as it completely obscures advances in our Jupyter notebooks.

Recent advances in GitHub and GitHub desktop visualization of commit changes has made it easier to deal with the navigating the diff of a the large JSON formated .ipynb file content. For this reason, I think we should revert to our original approach. I'll do this shortly.

Meanwhile, I'm reopening this issue to remind us to explore Option 3 – Use a merge management tool. A few articles on this topic are worth reviewing:

One new option is particularly interesting: https://www.reviewnb.com (free for public repositories)

respec / HSPsquared

Improve git-tracking for Jupyter notebooks #35