simonw / csv-diff

Python CLI tool and library for diffing CSV and JSON files
Apache License 2.0
292 stars 47 forks source link

Add support for multi-column --key values #17

Open jsvine opened 3 years ago

jsvine commented 3 years ago

These modifications allow users to pass multiple (comma-separated) columns as the --key, for scenarios in which rows are uniquely identified by a combination of columns — for instance, the county and the state. For instance:

csv-diff --key=state,county a.csv b.csv

An arbitrary number of columns can be used. These scenarios are fairly common, in my experience.

I aimed to make this implementation as simple as possible. As such, it doesn't handle one particular edge case: columns whose names contain a comma. My instinct is that this could be handled by adding a --key-sep option, in which the user could pass any arbitrary string to serve as a separator. E.g.,:

csv-diff --key="Column Name, With A Comma::Column 2" --key-sep="::" a.csv b.csv

... and then passing that argument to load_csv/load_json. But figured I'd raise the possibility here first before mucking around too much in the code.

jsvine commented 3 years ago

And I meant to say: Thanks for such an elegant and useful repo/tool! The code was a pleasure to read.

jsvine commented 3 years ago

Ah, and while tinkering to scratch my own itch, I failed to recognize that something similar was proposed in #1!

This PR takes a slightly different approach (sticking with a single --key option, rather than multiple), based on my own personal preferences. No offense taken if you opt for the other one.