jupyter / nbdime

Tools for diffing and merging of Jupyter notebooks.
http://nbdime.readthedocs.io
Other
2.65k stars 159 forks source link

use nbdime as a clever git filter #478

Open qmarcou opened 5 years ago

qmarcou commented 5 years ago

Hi, First of all thank you so much for the work on nbdime, it really makes jupyter notebooks integration in a version control scheme much easier!

Still, I'm still struggling to get some kind of "optimal" git tracking of my notebooks by preventing metadata and output to be changed at every commit. I have checked (hopefully thoroughly) the different issues (e.g #423 and #410 ) and pieces of documentation related to this.

From what I gathered, here is what I got (please correct me if I'm wrong):

Basically this only leaves 2 solutions: either track every single change in metadata and output or never have them in the git history.

I think it would be good to have an intermediate one allowing to track (chosen) metadata and output and add changes in metadata/output to commits only when desired. It would be quite helpful when you have a notebook full of plots, some of them potentially long to generate, to be able to keep a png of it inside the notebook (though I agree that if the plots takes time to generate one should probably find a workaround by saving processed data and/or the figure in a convenient format).

I was thinking along this line trying to find a solution, and I thikn I found a track: The idea would be to use nbdime as a smarter filter than nbstripout. Since nbdime is able to nicely compute diffs one could exploit this ability to revert all changes in input/metadata/output to be similar to the last commit (the idea would be to have something similar to git checkout myfile.ipynb that would only revert pieces of the file). For example if one only wants to commit changes in input cells, nbdime could compute a diff on everything but input cells (usually we would have used nbdime the other way around), and then revert all differences found in that diff to the last commit (we should be doable since the diff gives a line by line mapping, such that line by line substitutions/insertions/deletions can be performed). This would be executed as a special git input filter for instance (people would be able to create git aliases for different git add strategies). I think this approach would be a good compromise the the problem exposed above.

Maybe I'm missing some details making this approach untractable, but given how nicely nbdime works I feel it could be implemented. What do you think?

Sorry for the very long message I've been trying to make myself as clear as possible. Again thanks for the good work!

Best

vidartf commented 5 years ago

Hi!

If you wanted to make a git filter based on nbdime, I think the simplest logic would be to:

This could possibly be added to nbdime as another CLI entry point, but it might be better to play around with the idea as a separate script first (simply importing the methods from nbdime). If you get something working, we can look at helping getting it integrated with nbdime via a PR.

qmarcou commented 5 years ago

Hi! Thanks a lot for the precise pointers! I'll try and play around with a script, see if this actually works and whether it's a useful feature. I'll keep you updated Thanks!

kynan commented 4 years ago

@qmarcou did you get anywhere with this?

FYI, nbstripout has some options to control what to strip and what to keep.

qmarcou commented 4 years ago

Hi @kynan Nope sadly I've been quite busy and did not have time to look into this... Thanks for the pointer in case I or somebody else find some time,

stephanecollot commented 4 years ago

I'm really interested in this feature. Specially because if I understood well, this would also keep your local cell output if you git pull/checkout, right? In nbstripout it is removing all local cell output after pull/checkout operations.

qmarcou commented 4 months ago

Yes that's the idea, sadly I never had the time to dig dipper into it...