jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.5k stars 71 forks source link

Allow `qsv diff` to show only fields that differ #2000

Open mfripp opened 3 months ago

mfripp commented 3 months ago

Is your feature request related to a problem? Please describe. In csv files with many columns, it can be difficult and unreliable to find the particular fields that differ between dropped and added rows. This requires carefully scanning across the output, using a grid-oriented csv viewer.

Describe the solution you'd like One possible solution would be to add a --drop-identical-fields flag (or something similar), which will cause identical fields between a "-" and "+" row to be replaced with either empty values or a flag like "(same)". Then, before outputting the results, any columns that don't have any changes (i.e., the column is entirely full of empty fields or "(same)" markers) will be dropped. So the output file will only contain the key columns and any data columns that actually have differences, and even in those, it will only show values when there are differences. This will make it easy to see exactly what data is different between the two files.

Describe alternatives you've considered One alternative is to open the result in a spreadsheet and add flags to indicate where differences occur, but this is cumbersome. Currently I just scan visually across pairs of rows, but this is also cumbersome and error prone.

Another option might be to output a sort of "patch" format, with one row per different field. This could be a table where the first n fields are the index values, the next field is called "column" and gets the name of the field that differed, the next field is called "left_value" and has the value of this field from the left file, and the final field is called "right_value" and has the value from the right file. That might be clearer (no risk of conflict with existing empty fields or fields that already say "(same)"), but I'm not sure it's better.

Another option that might be better would be to use color to highlight the columns that are actually different, at least when output is sent to a TTY. This would be similar to the display in the GNU version of the diff command, VS Code's diff view, Apple's FileMerge viewer or vim -d file1 file2.

Additional context (none)

jqnatividad commented 3 months ago

Thanks for the well thought-out feature request @mfripp !

Copying in @janriemer - csv-diff's maintainer...

janriemer commented 3 months ago

Thank you, @mfripp, for the detailed description and thoughts on this feature (and @jqnatividad for making me aware of it)!
I really like the possible solution you've described and I feel like this should have highest priority regarding next features of diff command.

@jqnatividad Can you please assign this issue to me. Thank you.

The possibility of getting the fields that are different is actually already in the implementation of diff - it is just not used yet (waiting on a feature request like yours 😉): https://github.com/jqnatividad/qsv/blob/08cfda6383e6ff70e683df5a77b6b2ef6530c4d9/src/cmd/diff.rs#L245-L251

So it shouldn't be too difficult to implement your idea (famous last words?). 🙂

Unfortunately, I'm a bit busy lately, so didn't have the time currently.😢
However, mid/end August should be more time, so I can start implementing a prototype then. 🤞

With regard to your alternative solutions

janriemer commented 2 months ago

Hey @jqnatividad @mfripp :wave:

here is the current status of the feature requests in this issue

For the other feature requests it is probably best to create separate issues for them, so that we don't lose the overview.

mfripp commented 2 months ago

Thanks, this is great to see!

jqnatividad commented 2 months ago

Just merged #2114 ... just in time for qsv 0.134.0! Thanks @janriemer !