Create git commits by cell

philosofool commented 2 years ago

I would like to be able to commit individual cells with Git. My company uses notebooks for development. Committing from an interactive environment tends to create messy histories where parts of the developer's interactive state gets committed alongside important changes; this makes it hard to see what's "really" being added to the codebase. Alternatively, the developer needs to work around this, which creates a lot of overhead and interrupts interactive development.

A really nice solution would be to have a feature that allowed individual cells to be committed, so that completed work could be committed with a suitable commit message while the rest of the developer's work remained in its current state.

rchiodo commented 2 years ago

Thanks for the suggestion. Can you answer some clarifying questions?

When you say commit a cell, would the commit in git be for the entire notebook, or do you want it to generate say a python file for each commit?
You said committing now creates really messy histories where part of the interactive state gets committed. Did you mean outputs for cells? Could this problem be solved by having outputs cleaned prior to commit for notebooks? Meaning this might solve your entire issue and then no special 'commit for a cell' would be necessary?

philosofool commented 2 years ago

good questions.

For committing a cell, would I would like is to commit changes to a single notebook file one (or more) cell(s) at a time. We have a process for exporting notebook work to python files. The notebook is the where the development happens, and so we would like to record commits to the notebook files (which ultimately get exported to python files.)

As far as the messy histories, a lot of what we are doing is data science. One usually needs to complete an end-to-end model before one knows what should get committed. At that point, you have one monolithic notebook to commit and it's hard for a reviewer to go through commit by commit to see what's being added. It's a little like being told to read a novel and then being asked to talk about the first four pages of chapter 4. When changes need to be made if there are multiple changes that interact, it's again often necessary to have the end-to-end model before you know it's finished. It's much nicer, however, if each change gets its own commit--the notebook isn't living in that state, however, and incremental commits of a json files will invariables result in broken jsons. One solution to the monolithic issue is to finish the notebook but the commit sections (individual cells) to the history. To some degree, this is probably wrapped up in a development procedure that's imperfect, but data science is a different from traditional code writing in that you're always exploring whether a solution works and you never really know until you look at the model's performance.

rchiodo commented 2 years ago

Thanks that helps.

If I'm understanding correctly the main reason for the committing each cell is because you might still be 'working' on the other cells and you don't want to commit them yet.

Sounds like a great idea.

philosofool commented 2 years ago

That's a solid summary which probably covers a lot of use cases that are not mine!

DennisL68 commented 2 years ago

That would require jupyter to use one file per cell, or else you would just end up with a lot of file conflicts when merging the differented notebook files, right?

If working by the agile principle "Stop starting, Start finishing", you could try to work with only one cell at a time?

microsoft / vscode-jupyter

Create git commits by cell #8471