sctweedie / csvdiff3

3-way diff/merge tools for CSV files
MIT License
9 stars 4 forks source link

csvdiff3 : 3-way diff/merge tools for CSV files.

The csvdiff3 package offers the csvmerge3 command, which performs intelligent 3-way merging between CSV files. 2- and 3-way diff are planned but are not currently part of the package.

csvmerge3 can:

The package also includes a csvhooks CLI tool to provide basic CSV validation/formatting for use in git hooks and filters.

Getting started

Use

python3 setup.py install [--user]

to install the package.

How to use csvmerge3

A 3-way merge involves merging together two separate sets of changes to the same original file. csvmerge3 identifies the changes versus the original file and attempts to apply both sets of changes to the output; where it detects a conflict, it will emit a conflict marker similar to those output by normal "diff".

The original file is termed the "LCA": Latest Common Ancestor, and we refer to the two sets of changes to be merged as the A and B sides of the merge. The LCA, A and B files are supplied to csvmerge3 as filenames on the command line. Output may be to stdout, or may be sent to a given file with the -o filename option.

csvmerge3 operates on CSV files where the first row of the file is a header, providing names for each column of the file. One of these columns must be identified as the primary key. The merge will attempt to match together lines from each file based on the primary key, even if those lines appear in a different order in different source files (although it will try to preserve the order in the input files as much as possible.)

csvmerge3 will identify the following changes and attempt to merge them together:

In all cases,

NB csvmerge3 does not create a conflict if a column is deleted entirely in one side of the merge, and that column's contents are modified on the other.

A conflict block could look like this:

name,count
apple,2
>>>>>> input @3 (banana)
>>>>>> count = 4
banana,4
====== input @3 (banana)
====== count = 5
banana,5
<<<<<<

The conflict block is the content between the first >>>>>> marker and the final <<<<<<. The line above the ====== markers represents the line as it appears in the A file; below that marker is the B version of the line.

The content indicates both the line number (@linenr) and key (banana) for the source lines in A and B; followed by a line for each field in which there is a conflict. In the example above, the A and B files have assigned two different changes to the "count" column for the line with name "banana", and the value of that field in A and B is listed before the contents of the full line itself.

If a line is deleted on one side of the merge, but modified on the other, then the missing line will be identified in its line-number marker, and the contents of any fields with conflict will appear as "None". For example

>>>>>> input Deleted @5
>>>>>> count = None

A field marker line in a conflict block will always consist of just one line. If a CSV row covers multiple lines in the input (because a newline has been included in a field by quoting/escaping), then that newline will appear as a "\n" in the conflict marker line. But in this case, the full output row will still appear exactly as it appears in the input, potentially taking more than one line of the output within the conflict block.

As with most merges, in order to complete the merge manually, the user is expected to search for all conflict markers and decide the correct content within them. This can often be done by selecting one of the original lines above or below the ====== markers, and deleting the rest of the conflict block; but sometimes the user may need to manually edit the line contents to select the desired field contents. csvmerge3 does not automate selection of which fields are wanted in the output.

Using csvmerge3 automatically within git

csvmerge3 can help with 3-way merge operations in git. Typically, these merges are performed by git when you request a git pull to merge two different branches together.

To configure git to use csvmerge3, you will need to define a merge driver in a .git/config file: eg.

[merge "csv_pkg"]
    name = 3-way CSV merge
    driver = csvmerge3 --key=\"Package Name\" --quote=all --reformat-all %O %A %B -o %A
    recursive = text

NB. this is not just a generic merge driver for all CSV files; csvmerge3 needs to know what the primary key is for a CSV file, so that it can identify which lines from different input files are supposed to match each other. You can have CSV files with different primary keys within a single git repository, but you will need to define a specific merge driver with the right primary key on the csvmerge3 command line for each such key that you want to use.

You must have a --key option defined for the csvmerge3 command in a merge driver. Optionally, you can also add --quote , --lineterminator or --reformat-all options to coerce output into a specific format.

Once the merge driver is defined, you can set up a line in a .gitattributes file to enable it for a given class of CSV files:

fruit.csv    merge=csv_fruit
*.csv        merge=csv_pkg

git rebase does not normally use 3-way merging, but it can be told to do so by using the -m option. In this case, it will use csvmerge3 if that is configured as the merge driver for a given file.

Utility commands to help manage CSV files in a git repo

The csvdiff3 package also comes with a csvhooks command that provides features usable within git hooks or filters.

One use case is to set up a "filter driver" to make sure that CSV files in the repository are always stored and compared in a consistent format (eg. with a specific line terminator, or with particular quoting rules).

A filter driver can also make git diff more readable by ignoring line-termination or quotation formating changes and showing only actual changed data.

The filter driver is set up with a block in .git/config similar to:

[filter "csv"]
    name = CSV normalisation filter
    smudge = cat
    clean = csvhooks --quote=all --lineterminator=unix reformat

Once the filter driver is defined, you can apply it to files of your choice with lines in .gitattributes files, eg.

*.csv   filter=csv

If you add a filter driver that changes the stored format of any files already present in the repository, you should perform a

git add --renormalize .

command to apply the formatting to the repo (those changes can then be committed to the repository in the usual manner.)