stuckyb / ontopilot

15 stars 2 forks source link

better version control support for spreadsheet files #78

Open stuckyb opened 7 years ago

stuckyb commented 7 years ago

Spreadsheet files are typically binary blobs, which aren't well supported by the standard version control workflow. The most important loss is the ability to see what changed from one version of a file to the next.

I recently became aware of the SYmbolic LinK (SYLK) file format, a plain-text format for spreadsheet files that supports cell formatting and formulas, unlike CSV. In theory then, SYLK might be preferable to binary formats like XLSX and ODS since it supports a similar feature set but would work better with version control software. Excel and Libre/OpenOffice both support it natively.

Another option is to build filters that can be plugged into git that would support better diffing of spreadsheets. This would not be a perfect solution, because files would still be stored as binary blobs, but it would address the most important usability problems.

stuckyb commented 6 years ago

I am experimenting with option 2, described in the original comment. The results are promising. I've written a simple text filter, called "tables2txt", based on the existing OntoPilot tablereader architecture, that converts spreadsheet files to plain CSV text, which then allows git to use its usual diff routines. Informal testing suggests that this works fairly well. Here are a few shortcomings:

  1. When changes to a file involve only formatting changes, git diff will report no differences. Fortunately, git still detects that the file has changed, so the mechanics of version control work fine. This also confirms that the diff filter is not used to detect file changes.
  2. The CSV output only includes cell values, not formulas. Thus, if formulas change but the resulting values do not, git diff will report that the files are the same.
stuckyb commented 6 years ago

Another option for a custom git filter is to use ExcelCompare, which generates diffs of spreadsheet documents and supports both Excel and OpenDocument formats. I did some tests with ExcelCompare, and although it seemed promising, there were at least two major problems:

  1. Tests with Excel spreadsheets were very fast, but tests with OpenDocument spreadsheets were very slow (e.g., >40 minutes before I terminated the program). I suspect that the problem has to do with interpreting cells that only have formatting specifications as non-empty (see the comments in the OntoPilot TableReader code for more details). I did not inspect the source code to confirm this, though.
  2. The diffing algorithm appears to be a simple cell-by-cell comparison; that is, it is unable to detect changes such deleting a single row, e.g. This is fine for many use cases, but in instances where one or more rows are added or deleted, it results in an explosion of output that is almost impossible to interpret.

For the above reasons, I concluded that ExcelCompare was not a viable solution.

stuckyb commented 6 years ago

One relatively minor annoyance with the "tables2txt" git diff filter solution is that it is relatively slow due to the long startup time for jython. I recently discovered that git can cache the results of "textconv" filters, and this has the potential to improve diff performance.

Using the spreadsheet described in the "Using OntoPilot with version control" wiki page, I did three timing runs for git diff example_classes.xlsx without caching. The best of these completed in 18.118 seconds.

I repeated the timing exercise after turning on textconv caching (git config diff.tables2txt.cachetextconv true) and running the command once to seed the cache. This cut the runtime by half, down to 8.958 seconds. The run time was only cut by half because git was still launching "tables2txt" every time for the current (non-commited) version of the file.

I then commited the changes to the spreadsheet file and diffed the versions from the last two commits (git diff HEAD~1 HEAD example_classes.xlsx). After the initial caching run, this cut the runtime down to 0.007 seconds. So caching can provide a huge performance boost and, until we figure out how to improve script launch time or rewrite tables2txt as pure Java, I should add caching to the OntoPilot documentation for working with git.