Closed danielecook closed 7 years ago
Hi Daniel, well picked up! It's assumed both files have the same columns.
If they didn't, then I wonder how we'd want to represent it in the diff. Perhaps we could say that values changed from None
to something, for every row (or vice versa).
Do you have a use case yourself for this?
More just kind of playing around. The biggest case that I can think of would be where someone adds or removes a column in a csv file (or renames one). I wound up including the module and made a few minor changes (I hope you don't mind!) in a tool I am developing. Its a little utility for viewing csv diffs in git repos:
Right now its not very sophisticated. In the screenshot there - its treating a column rename as the addition and removal of a bunch of cells.
Nice! I think treating column renames as addition and removal wouldn't be too difficult. Detecting renames automatically might be more expensive though.
I don't have time to put together a patch this week, but contributions are more than welcome.
Great - I'm a little flooded as well, but i'll see if I can find some time in the next few weeks. In terms of the patch data structure, how would modify it to account for added / removed columns? Setting the to
to None
?
Yeah, I reckon added and deleted columns go to and from None for every row, and let's say you can't change the index columns for now. That keeps the patch format basically the same.
Thinking about it, I've realised that the whole project is doing a row-based diff, which is why operations on whole columns aren't so natural. But row-level is pretty useful for tons of applications.
On Wed, 4 Nov 2015 at 00:14 Daniel E Cook notifications@github.com wrote:
Great - I'm a little flooded as well, but i'll see if I can find some time in the next few weeks. In terms of the patch data structure, how would modify it to account for added / removed columns? Setting the to to None?
— Reply to this email directly or view it on GitHub https://github.com/larsyencken/csvdiff/issues/9#issuecomment-153518752.
Gonna close this, and accept that we're row-based instead of column-based.
Hi, I would like to reopen this ticket as currently csvdiff will raise an exception (KeyError in patch.py, record_diff, line 264) when your rhs (new file) has a column that does not exist in the lhs (base file) and vice versa.
I am using csvdiff to analyze the output of some database tests and there are some rare occasions where columns will be added, removed or renamed. As I am only comparing files I don't need a patch file to convert my files.
I was thinking about checking the header line of the csv for added and removed columns (renamed columns are removed and added under new name) and if there are any changes just skip csvdiff analysis completly. But then I discovered with only a little rewriting, you can fix the missing key error and the output is exactly what I looked for.
I am accepting that csvcompare is row-based, but still there shouldn't be a python error, when you compare files with different columns. What do you think?
I did a pull reqeust #34 and so far all the tests on the ci-server still work 😀
Just ran into the same KeyError as @friederschueler explains. I understand the solution he proposes might not apply but a Python traceback is hardly a graceful way to die.
It doesn't look like you can execute diffs when you've added or removed a column from a csv unless I am missing something. Perhaps this would be useful to implement? (Happy to help!)