larsyencken / csvdiff

Generate a diff between two tabular datasets expressed in CSV files.
BSD 3-Clause "New" or "Revised" License
132 stars 31 forks source link

accounting for columns #9

Closed danielecook closed 7 years ago

danielecook commented 8 years ago

It doesn't look like you can execute diffs when you've added or removed a column from a csv unless I am missing something. Perhaps this would be useful to implement? (Happy to help!)

larsyencken commented 8 years ago

Hi Daniel, well picked up! It's assumed both files have the same columns.

If they didn't, then I wonder how we'd want to represent it in the diff. Perhaps we could say that values changed from None to something, for every row (or vice versa).

Do you have a use case yourself for this?

danielecook commented 8 years ago

More just kind of playing around. The biggest case that I can think of would be where someone adds or removes a column in a csv file (or renames one). I wound up including the module and made a few minor changes (I hope you don't mind!) in a tool I am developing. Its a little utility for viewing csv diffs in git repos:

screen shot 2015-11-02 at 3 37 45 pm

Right now its not very sophisticated. In the screenshot there - its treating a column rename as the addition and removal of a bunch of cells.

larsyencken commented 8 years ago

Nice! I think treating column renames as addition and removal wouldn't be too difficult. Detecting renames automatically might be more expensive though.

I don't have time to put together a patch this week, but contributions are more than welcome.

danielecook commented 8 years ago

Great - I'm a little flooded as well, but i'll see if I can find some time in the next few weeks. In terms of the patch data structure, how would modify it to account for added / removed columns? Setting the to to None?

larsyencken commented 8 years ago

Yeah, I reckon added and deleted columns go to and from None for every row, and let's say you can't change the index columns for now. That keeps the patch format basically the same.

Thinking about it, I've realised that the whole project is doing a row-based diff, which is why operations on whole columns aren't so natural. But row-level is pretty useful for tons of applications.

On Wed, 4 Nov 2015 at 00:14 Daniel E Cook notifications@github.com wrote:

Great - I'm a little flooded as well, but i'll see if I can find some time in the next few weeks. In terms of the patch data structure, how would modify it to account for added / removed columns? Setting the to to None?

— Reply to this email directly or view it on GitHub https://github.com/larsyencken/csvdiff/issues/9#issuecomment-153518752.

larsyencken commented 7 years ago

Gonna close this, and accept that we're row-based instead of column-based.

friederschueler commented 6 years ago

Hi, I would like to reopen this ticket as currently csvdiff will raise an exception (KeyError in patch.py, record_diff, line 264) when your rhs (new file) has a column that does not exist in the lhs (base file) and vice versa.

I am using csvdiff to analyze the output of some database tests and there are some rare occasions where columns will be added, removed or renamed. As I am only comparing files I don't need a patch file to convert my files.

I was thinking about checking the header line of the csv for added and removed columns (renamed columns are removed and added under new name) and if there are any changes just skip csvdiff analysis completly. But then I discovered with only a little rewriting, you can fix the missing key error and the output is exactly what I looked for.

I am accepting that csvcompare is row-based, but still there shouldn't be a python error, when you compare files with different columns. What do you think?

I did a pull reqeust #34 and so far all the tests on the ci-server still work 😀

halsafar commented 5 years ago

Just ran into the same KeyError as @friederschueler explains. I understand the solution he proposes might not apply but a Python traceback is hardly a graceful way to die.