Version Control - Githubissues

vkbo / novelWriter

novelWriter is an open source plain text editor designed for writing novels. It supports a minimal markdown-like syntax for formatting text. It is written with Python 3 (3.9+) and Qt 5 (5.15) for cross-platform support.

https://novelwriter.io

GNU General Public License v3.0

2.15k stars 111 forks source link

Version Control #383

Open vkbo opened 4 years ago

vkbo commented 4 years ago

It may be worthwhile to set up git integration for novel projects via pygit2. Only basic support is needed for a version history, possibly just a single branch with a linear history.

johnblommers commented 4 years ago

YES to that.

What I find to be missing from (all?) novel writing software is an easy way to compare the current version of the manuscript with a prior version, side-by-side. With an easy way to navigate through the versions until you can see the visual diff you're looking for.

I keep my textual documents in a local git repository mostly because I hate the feeling of suddenly realizing that a chapter has gone missing and not being able to find it again. But it's not easy to find the change given the commit comments are the main guidance.

vkbo commented 4 years ago

It has always been my intention to add this at some point. Personally, I keep all my projects in a local git repo, but it would be nice to have a way to quickly scroll through and look at various versions.

Still, implementing this in a clean way may be tricky due to the main project file also containing settings.

vkbo commented 4 years ago

Just making some notes of how to potentially implement this.

Technical

The technical implementation is fairly simple as most of the heavy lifting is done with git via pygit2 and the Python standard library difflib.

The feature is enabled as an option in a "Project Maintenance" (or whatever) dialog that also lists existing backups.
The feature uses git under the hood, and will create a repository containing only the contents folder. This effectively means it will only track changes to files, and not care about whether they have been deleted from the project or not. This will keep the file meta data static, but the content will have a change history.
Deleted files that the user wants to restore will lose the meta data, but the data can be set again by the user. I don't think this is a major issue. Restored files will behave a bit like current "Orphaned Files", but both featured could be merged into a "Restored Files" root folder and feature that serves both purposes.
The user should be able to manually checkpoint (commit) via the dialog gui, selecting which files to checkpoint. Under the hood this will generate a commit of the selected files onto a specially named branch. The git comment should be a user controlled edit box, and the history can then easily be listed with a single call to git log.
A first iteration of the feature should not support branches. Everything is commited to a single branch.
A simple diff tool should be able to show the changes of a single file. This is easy to add since the Python difflib package already has all the features needed.
Restoring a file to an older version can be done fairly easily with git reset.

Ease of Use

This can be implemented in two fairly simple and intuitive dialogs. One for browsing file history of present and deleted files, and one for making checkpoints of the project and reviewing the record of past files that have been removed (two tabs).

Checkpointing

The user will only see the option to checkpoint files that have been changed since last checkpoint. The files will be displayed in a list with the file path and a checkbox (essentially git status -s). An optional "select all" options should be available too.
The checkpoint takes a single line edit box for a commit message.

Restoring Tracked Files

A "view file history" should pop open a dialog with a list of checkpoints for a file, and a diff window that will show the difference between the checkpoint and current file for each selection.
The user can then chose to copy/paste text from the dialog back into the editor if they just want to restore a discarded paragraph, or chose to restore the entire file to an earlier state.

Deleted Files

Deleted files should appear in a list in the same dialog (different tab?) as the checkpoint feature.
Selecting the file there should pop the Restore dialog so the user can select which version to restore,
The restored file should appear in a "Restored Files" root folder.

References

Git resetting and diff: https://stackoverflow.com/questions/215718/how-can-i-reset-or-revert-a-file-to-a-specific-revision Pygit2: https://www.pygit2.org/

johnblommers commented 4 years ago

All good. Let me throw the following at the wall:

Consider the following use case: I've just realized that I've lost a scene that describes a mountain valley. I know which file used to contain it. I have no idea when I deleted that text. I know it was about four paragraphs long. I would recognize it at a glance. So I'd like to be able to zip through all of the commits (perhaps using ↑ and ↓ until my eye catches the four missing paragraphs.

novelWriter can reduce friction by:

Shifting the focus in the commits to the first delta and letting the user skip easily from change to change until "aha!"
Providing a pixel map of the document highlighting the deltas as in the screenshot below
Letting the user search for the missing text so novelWriter limits the number of commits the user needs to pour through

Screenshot from 2020-08-07 12-51-35

Does this stick to the wall?

vkbo commented 4 years ago

As long as you know which file it was, that feature would be covered by the diff view. The diff view would be for a single file, showing the entire history of checkpoints (commits).

Searching, I don't know. I suppose an automated walk through the history could achieve that. Something to consider.

timotheos-firestone commented 4 years ago

I am new to novelWriter and starting to use it because I also really need to version control my novel :)

Personally, I just want to see the data format to be more portable and human-readable so that I can more easily version control it manually, even if there is no Git integration into the application (to start with).

For example, I would like to see:

the .nwd files named something more semantic (e.g. 1.4 scenetitle.md, so that it orders correctly by default in a filesystem)
the JSON format including newlines, so that each option is on a new line, so it doesn't register as an entire file change when one small part is updated (I'm looking at the tagsIndex.json)

There might be more than this, but these are the immediate things I'm seeing as roadblocks for my use case.

vkbo commented 4 years ago

Hi, and thanks for the comment and feedback.

The project file structure isn't meant to be human readable. The file structure itself is version control friendly (and file sync friendly), but the filenames aren't. That is why integration is being considered. I'm also considering other ways to achieve file versioning.

As for your two points.

The .nwd extension is a leftover from a previous iteration of novelWriter where the documents were saved in a mixed format of meta data and html. The files aren't pure markdown now either, so I am reluctant to rename them to .md.

However, I expect the main issue here is the filename itself, which is just the internal hash key of the file. The reason the file is saved this way is that it is much simpler to keep track of the files and where they belong in the project tree when the file names are static and unique. There is no need to consider file name collisions, and no need to rename the actual files when the user changes the document name.

There are no plans to change the file structure to mirror the project structure at this time.

There are two ways to identify files though:

The first line of the file contains the title and type of file.
The ToC files in the main folder contain an overview of all files and what they are named inside novelWriter.

Your second point is a bit more straightforward to answer. The project structure is stored in the nwProject.nwx file, which is XML with line breaks exactly for the purpose of version control. This is the important file. The one-line JSON files are cache files. You shouldn't version control them at all. Just add *.json to your .gitignore file or equivalent.

The ToC.json file contains exactly the same as ToC.txt.
The guiOptions.json remembers dialog window sizes, columns widths, and last button states for various GUI options that are not regular preference settings.
The tagsIndex.json file is a cache of the project index. It is rebuilt automatically if it's lost, and it contains no unique information.

If you run novelWriter from command line with the --debug flag, line breaks are inserted into all the JSON files to make them human readable for debugging.

vkbo commented 4 years ago

I've spent some time thinking about this. Also, see discussion in Issue #259 which covers the folder structure and format.

There is a possible compromise here. When novelWriter opens a project, all files in the contents folder are listed and processed. As long as the first 13 characters remain static and unique like now, I can in principle add the document title after this. The scan while opening will be able to recognise the file based on the first 13 characters, and just store the file system name in a map for the duration of the writing session.

I don't particularly like this idea, but it is a possible solution. It may cause issues with git versioning too since git considers a file rename to actually be a new file (although the diff may present it otherwise in most cases), which may or may not break version history if the file has been changed a lot while also being renamed.

Edit:

I will also switch on indentation permanently on the JSON files (also when not in debug mode). The cost is a larger file size, and slightly slower load time for the index. But it isn't really a big deal as it all loads in milliseconds anyway.

timotheos-firestone commented 4 years ago

Ah, okay, I was going to mark the cache directory as ignored, because I thought all the cache files would be in there. Should some of the JSON files be in that directory, too? I would suggest potentially renaming nwProject.nwx to nwProject.xml but that's an aside and certainly not blocking anything :)

I'm not sure how the software works, but I was just looking at the data format in the filesystem. I tried manually changing the filenames so that they were alphanumerically ordered in the filesystem, and updated their first line contents and the XML, and it appears I could do this successfully:

000000ba97b1a (title page) 001000334d1cc (chapter 1) 0010019935f3e (chapter 1, scene 1) 001002796495e (chapter 1, scene 2)

but I couldn't use strings.

Could the handle="000000ba97b1a" become an alphanumeric string type to allow hacking so I can rename the files manually to something more meaningful to me?

Alternatively (more ideally for me), I noticed all the files in content duplicate their handle in the first %%~ line of the file. Could the program ignore the filenames and instead (on start-up) just scan all the nwd files in the content directory and look for the unique handle in the first line of the file? It could then make a map in its cache of the actual filenames and handles (or maybe save in ToC.json). This would presumably allow me (and the software) to rename the files to something more semantic :)

I actually use Mercurial as a version control system because it tracks file renames in the commits. I'd prefer to be able to keep this, which is why I'm not too fussed about the integration of Git :)

vkbo commented 4 years ago

Ah, okay, I was going to mark the cache directory as ignored, because I thought all the cache files would be in there. Should some of the JSON files be in that directory, too?

Possibly. The cache folder is used for slightly different things though, and was used for duplicates of files for redundancy purposes in earlier versions. Currently, it's only used by the build tool.

I would suggest potentially renaming nwProject.nwx to nwProject.xml but that's an aside and certainly not blocking anything :)

The .nwx extension is associated with novelWriter. The setup script can install a mimetype for it that launches novelWriter when you double-click the file in the host OS. Using the .xml extension would significantly complicate this, and gain nothing.

I'm not sure how the software works, but I was just looking at the data format in the filesystem. I tried manually changing the filenames so that they were alphanumerically ordered in the filesystem, and updated their first line contents and the XML, and it appears I could do this successfully:

000000ba97b1a (title page) 001000334d1cc (chapter 1) 0010019935f3e (chapter 1, scene 1) 001002796495e (chapter 1, scene 2)

but I couldn't use strings.

Any 13 character hexadecimal string is acceptable to novelWriter, so it is technically possible to do this, although it bypasses all the internal functionality in novelWriter that creates and handles new documents and makes sure they are created in the correct place with the correct meta data set. These renamed files will be unknown to novelWriter and considered "orphaned". But when you launch the app, you will be able to move them around into your project tree and the app will try to set the correct meta data.

Any file that isn't 13 hexadecimals plus the extension .nwd will be ignored.

(You can have a look at the project.py and document.py files in https://github.com/vkbo/novelWriter/tree/main/nw/core if you're interested in what's happening under the hood.)

Alternatively (more ideally for me), I noticed all the files in content duplicate their handle in the first %%~ line of the file. Could the program ignore the filenames and instead (on start-up) just scan all the nwd files in the content directory and look for the unique handle in the first line of the file? It could then make a map in its cache of the actual filenames and handles (or maybe save in ToC.json). This would presumably allow me (and the software) to rename the files to something more semantic :)

The contents of the files are not loaded when the project is opened. All the information about a project, sans raw text, is stored in the nwProject.nwx file. That is where the file handles are associated with specific positions in the project tree structure, plus all the other settings associated with a document. When a document is loaded in the app, the file with the corresponding handle is opened into the editor window. That is all. The first line in the file is there to assist restoring a lost file, or to add back in a previously deleted file. It is a redundancy and the line is otherwise just ignored.

Anyway, it seems to me you're trying to use novelWriter as an editor of a collection of markdown files. That is not at all what it is intended for. What you're wanting to do here basically defeats the entire purpose of novelWriter and why I wrote it in the first place. I used to write large documents and projects (like draft novels, my theses, etc) by doing exactly what you want here: create folders of numbered files. Either .md, .tex or otherwise. It's a clunky way to manage documents because manually numbering them means you have to renumber them when you change the order. I wrote novelWriter specifically to abstract away that by organising the files in a database that you could ignore and instead manage the structure in a separate data file (the .nwx XML file) in a GUI that kept all of that in sync.

Various applications for writing text manage data files in mainly two ways. Either all the text is packed in a zip file, like Libre Office and Microsoft Office (they use multiple xml files in a zip archive) or they do what novelWriter does: keep the data in a data hierarchy in a folder. Scrivener does this, and so does several other similar writing tools. It is a fairly robust approach, which is why I made the same choice.

vkbo commented 4 years ago

The tagsIndex.json file now has newlines and indents also when not running in debug mode. This makes it possible to version control all files in the project folder without creating large diffs. I would still avoid tracking the cache folder though.

johnblommers commented 3 years ago

The other day I stumbled across the Diffuse GitHub repository. Diffuse can display the differences among N files at one time. It has only one missing feature, word wrap. There is also a Git entanglement.

My question is, is there anything about this project - written in Python :-) that might benefit novelWriter? For example could the user launch Diffuse from within novelWriter and have its N files comparison pop up?

My other question is: But for the missing word wrap feature I think I really like Diffuse. Do you suppose this is an easy addition?

No pressure but for the lack of word wrap I think Diffuse is the cat's meow for comparing many versions of a chapter in one display.

vkbo commented 3 years ago

I'm leaning towards implementing versioning by simply keeping multiple copies of each novelWriter document in a versions folder. If so, sending them to an external diff tool is probably fairly straightforward. I was also planning on making my own built-in diff tool based on Python's difflib.

I still haven't gotten around to starting this as I had a huge burst of effort on nW at the beginning of the year, and needed a break. But I also want the versioning for my own projects, so I want to start on it soon.

I can even implement a stage 1 solution that will just be able to preview the older version of the document in the viewer next to the current version in the editor. Then the user can copy/paste over text if needed. It's a starting point anyway.

johnblommers commented 3 years ago

Taking a break after a huge burst is smart.

My big takeaway from Diffuse is the ability to display N versions along with their deltas at once. Current tools max out at three versions.

The use case I envision is choosing a particular scene and asking novelWriter to display the last N versions with diffs highlighted. Possibly with an option to then scroll left and right when more than N versions are in the project. You'd scroll up and down when the scene won't fit vertically and all N versions would scroll in unison.

As far as I know from researching this, Diffuse comes close but lacks the word wrap feature. It's a tool meant for developers with short lines of code, I guess.

vkbo commented 3 years ago

I've been working on the version tool on and off for a while now, and I'm starting to converge at a suitable solution that I hope is useful without being overly complicated.

The features implemented so far are:

A new tab in the tree part of the main window where the saved versions for the currently open document are visible.
The first time a document is opened in a session, a temporary version copy is made of the unaltered version. This means that any changes made in a given session can always be undone, and compared to the current version.
A permanent version copy can be made manually at any time, with a version note.
Any version document can be opened in the document viewer alongside the editor,

Further features to be added:

The ability to restore any versioned document and have it replace the current document, optionally with the automatic saving of the current version before the restored version replaces it. This ensures that the restoration can be undone easily.
A diff view, either in the document viewer if it turns out to be practically possible, or alternatively in a separate diff window. I need to test this to find out what works best. I know that Python's difflib is capable of generating an html diff, but I'm not sure the viewer can show it properly because of the limitations of the QTextBrowser widget.

Implementation:

The session version of a document is saved in a versions folder with the same file name as the main document file, but with the .bak extension instead.
Each permanent version is saved in a subfolder for each document under versions with the same filename, but with a session ID added as well as an incrementing number in case multiple versions are generated in one session.
All of the practical handling of the versions of a document is handled by the main document class which is already always used for reading and writing these documents and parse the meta data contained in them. Thus everything is implemented in a single place (aside from the GUI changes).

xahodo commented 1 year ago

Scrivener (yes, the eternal scrivener) has a really simple solution for this. Instead of storing a full diff of the entire project, you can (at your leisure) make a snapshot of each individual document and attach a comment to each snapshot.

This may result that, with one scene you don't have any snapshots of a previous state (because you didn't need to), while with the next you might have five. The advantage of this is that you can work on each document independently. Reverting a document to a previous state, needn't affect the entire project. Another option would be to store the snapshot information (comment, timestamp/file version) in a separate file from the main text.

How are the snapshots stored? I don't know, but I guess the full document simply gets stored, possibly with compression (zip?) applied.

This eliminates the need for a full blown revision control system, and makes things easier on the writer.

A snapshot could, for example, simply be a zipped json file with the text, comment for the snapshot, and timestamp/version number. No need to make things more complicated then they need to be. The choice to zip the snapshots, could be a configuration option (some people have short scenes, others have extremely long scenes), another option is to implement a threshold (KB? word count? User configurable?) when compression happens.

vkbo commented 1 year ago

Scrivener (yes, the eternal scrivener) has a really simple solution for this. Instead of storing a full diff of the entire project, you can (at your leisure) make a snapshot of each individual document and attach a comment to each snapshot.

That is exactly what my attempted implementation did. I plan to give this a try again at some point. I bet Scrivener just creates a copy of the file and assign it a new UUID, and link it via meta data.

For full snapshots, the backup feature already does it. Of course, keeping track of them is a bit trickier as there may be many versions.

In any case, I plan to do something here, but I may wait until I've sorted out the project storage feature #977. Zipping isn't really a big point as these are just text files that take up next to no space.