larsyencken / csvdiff

Generate a diff between two tabular datasets expressed in CSV files.
BSD 3-Clause "New" or "Revised" License
132 stars 31 forks source link

Separator choice #14

Closed ariedeleen closed 8 years ago

ariedeleen commented 8 years ago

Is it possible to change the default separator komma to for example semicolumn.

larsyencken commented 8 years ago

The current release doesn't allow different delimiters, but I just put together a quick patch to allow it with --sep, for example csvdiff --sep=';' id a.csv b.csv. You'd have to check out master to try it out though.

Let me know if you need any help with it.

ariedeleen commented 8 years ago

Wow thanx Lars for the speedy implementation ;-) i'll look into it and let you know, Arie

Quick question: can I do a pip install git+https://github.com/larsyencken/tests/csvdiff.git to 'pull' the master? So --sep will be optional.

larsyencken commented 8 years ago

You can install it with:

pip install -e git+https://github.com/larsyencken/csvdiff.git#egg=csvdiff

Then you can check you have the right version:

$ csvdiff --help
Usage: csvdiff [OPTIONS] INDEX_COLUMNS FROM_CSV TO_CSV

  Compare two csv files to see what rows differ between them. The files are
  each expected to have a header row, and for each row to be uniquely
  identified by one or more indexing columns.

Options:
  --style [compact|pretty|summary]
                                  Instead of the default compact output,
                                  pretty-print or give a summary instead
  -o, --output PATH               Output to a file instead of stdout
  -q, --quiet                     Don't output anything, just use exit codes
  --sep TEXT                      Separator to use between fields [default:
                                  comma]
  --help                          Show this message and exit.

Notice the --sep option is now in amongst the documented options.

ariedeleen commented 8 years ago

Tried csvdiff --sep=';' 'Manufacturer partnumber' ykoonold.csv ykoonnew.csv -o difference.csv

But got this typical python error: TypeError: "delimiter" must be string, not unicode

Console output: _(pyenv)[arie@dev temp]$ csvdiff --sep=';' 'Manufacturer partnumber' ykoonold.csv ykoonnew.csv -o difference.csv Traceback (most recent call last): File "/home/arie/pyenv/bin/csvdiff", line 9, in load_entry_point('csvdiff', 'console_scripts', 'csvdiff')() File "/home/arie/pyenv/lib/python2.7/site-packages/click/core.py", line 716, in call return self.main(_args, _kwargs) File "/home/arie/pyenv/lib/python2.7/site-packages/click/core.py", line 696, in main rv = self.invoke(ctx) File "/home/arie/pyenv/lib/python2.7/site-packages/click/core.py", line 889, in invoke return ctx.invoke(self.callback, _ctx.params) File "/home/arie/pyenv/lib/python2.7/site-packages/click/core.py", line 534, in invoke return callback(_args, _kwargs) File "/home/arie/pyenv/src/csvdiff/csvdiff/init.py", line 147, in csvdiff_cmd compact=compact, sep=sep) File "/home/arie/pyenv/src/csvdiff/csvdiff/init.py", line 158, in _diff_files_to_stream diff = diff_files(from_csv, to_csv, index_columns, sep=sep) File "/home/arie/pyenv/src/csvdiff/csvdiff/init.py", line 41, in diff_files from_records = records.load(from_stream, sep=sep) File "/home/arie/pyenv/src/csvdiff/csvdiff/records.py", line 21, in load return _safe_iterator(csv.DictReader(istream, delimiter=sep)) File "/usr/local/lib/python2.7/csv.py", line 79, in init self.reader = reader(f, dialect, args, *kwds) TypeError: "delimiter" must be string, not unicode (pyenv)[arie@odoo8dev temp]$ TypeError: "delimiter" must be string, not unicode

larsyencken commented 8 years ago

Ah, I some tests were failing for python2.7 which tox didn't bring up. I found and patched the problem.

Want to try again?

ariedeleen commented 8 years ago

Thanx I will look in to it asap. Now busy with another project ;-) And in the Netherlands it is spring break vacation time. And to day celebration of our kings birthday. Everything is orange dressed up. Funny Dutch man.

Question: diff large csv files? Let's say 750.000 lines and 12 columns it that possible. Or could it take day's for a result.

larsyencken commented 8 years ago

Haha, sounds nice :) In Stockholm snowed a few days in a row, then rained a few days. Not yet happy spring weather.

With large CSV files, it just has to fit into memory. If you were diff'ing two files of the size you mentioned, and there were no changes, it might take 30s. If there are lots of changes, maybe a few minutes? It should still work.

ariedeleen commented 8 years ago

Works like a charm ;) it pretty fast csvdiff-ed the two large files mentioned b4 ~ 52s. And is it also possible to get csv format back with an extra column at the end. With removed, added, changed. r a c for short.

larsyencken commented 8 years ago

Glad it works! Unfortunately, I can't look at the extra column idea right now. But, if you have a programming background, you could try using the csvdiff API and generating it yourself. Otherwise, you might have to rely on the statistics from --style=summary, or just reading the JSON output. Best of luck!