microsoft / gather

Spit shine for Jupyter notebooks 🧽✨
https://microsoft.github.io/gather
MIT License
533 stars 38 forks source link

Add a "Clear History" action #7

Open andrewhead opened 5 years ago

andrewhead commented 5 years ago

Is your feature request related to a problem? Please describe. A user might want to clear the history of a notebook, e.g., if they executed a cell with some sensitive data that they don't want to have stored to the notebook file.

Describe the solution you'd like Add an action to the interface that lets someone "Clear History". This would then reset the execution history log, and make sure that any history metadata saved with the notebook is emptied.

Describe alternatives you've considered Currently, an analyst could open up the ipynb file on their own, and delete the metadata that includes the execution history.

Additional context One additional benefit of this is reducing storage space for notebooks with very long histories.

mathematicalmichael commented 4 years ago

This is BEYOND NECESSARY and really should be priority number 1.

I had to adapt a custom linter to notebooks to get rid of dozens of MBs of data being written (and un-removable by all the standard cleaning tools), where this package was tracking every single output of matplotlib that the notebook ever created. It took way too long to discover this extension as the culprit (it was installed by default on a jupyterhub I'm working from).

here's the nuclear option to wipe all metadata that I came to: https://gist.github.com/mathematicalmichael/a206b2a21de0bf88a5703e8700403019

Suggestions welcome. Would be a preferable approach than the one outlined in README. That's not a sustainable solution for ~100MB worth of png image data.

andrewhead commented 4 years ago

@mathematicalmichael Thanks for weighing in. I'm sorry to hear you got bitten by this :/

Hearing your scenario makes me realized it's important to get a sensible defaults for recording, and allowing deletion, or history. It hadn't occurred to me that this plugin may be installed without the user electing to install it, which changes the equation.

I'll lay out a few potential designs I'm considering: A. The "nuclear" option: Add a Clear History button B. When a kernel is restarted, delete prior history C. Alert a user when the history is getting quite large, assuming that users might not be aware of the presence of the plugin and the archive it's building. Provide a button for them to clear the history and, if they wish, to disable the plugin

Your perspective and others' on a workable design are welcome.

mathematicalmichael commented 4 years ago

Hey, sorry I'm just catching up to my notifications.

It hadn't occurred to me that this plugin may be installed without the user electing to install it, which changes the equation.

Wow, that is really well said. I also managed to eventually get in touch with the right IT person who swapped some path permissions for JLab/Extensions, and was able to allow me to disable it altogether. I ended up only using that nuclear script of mine a couple of times.


I think A is absolutely necessary for those of us who need to use archaic tools to review git diffs of notebooks on occasion. B... hmmm. I can see the utility of wanting to "go back in time" for the cell specifically. C. sounds like an excellent idea. disabling should trigger A to minimize file size.

Riffing off of B. Getting rid of output is the real important part. Cell history can be preserved. So, maybe yeah... clear all figures and dataframe previews and such when kernel is restarted. They shouldn't be trusted anyway. But reverting to previous state of notebook text shouldn't be as taxing on the file size.

So that implies a version of A that wipes output history but preserves input history. could be a nice middle ground, alongside B. C would largely be avoided if B was in place I think.