paulfitz / daff

align and compare tables
https://paulfitz.github.io/daff
MIT License
797 stars 67 forks source link

Changed line appears as added + removed #91

Open gwarnes-mdsol opened 7 years ago

gwarnes-mdsol commented 7 years ago

The daff comparison algorithm improperly marks a row with changed data as an added/removed pair.

For instance, comparing the CSV files 'iris.csv' and 'iris2.csv' (via the edwinj/daff R wrapper), I get the following diff:

@@  Sepal.Length    Sepal.Width Petal.Length    Petal.Width Species
... ... ... ... ... ...
    5.7 2.8 4.1 1.3 versicolor
->  6.3 3.3 6   2.5 virginica->XXX
+++ 5.8 2.7 5.1 1.9 XXX
--- 5.8 2.7 5.1 1.9 virginica
->  7.1 3   5.9 2.1 virginica->XXX
->  6.3 2.9 5.6 1.8 virginica->XXX
->  6.5 3   5.8 2.2 virginica->XXX
->  7.6 3   6.6 2.1 virginica->XXX
->  4.9 2.5 4.5 1.7 virginica->XXX
->  7.3 2.9 6.3 1.8 virginica->XXX
->  6.7 2.5 5.8 1.8 virginica->XXX
->  7.2 3.6 6.1 2.5 virginica->XXX
->  6.5 3.2 5.1 2   virginica->XXX
->  6.4 2.7 5.3 1.9 virginica->XXX
->  6.8 3   5.5 2.1 virginica->XXX
->  5.7 2.5 5   2   virginica->XXX
->  5.8 2.8 5.1 2.4 virginica->XXX
->  6.4 3.2 5.3 2.3 virginica->XXX
->  6.5 3   5.5 1.8 virginica->XXX
->  7.7 3.8 6.7 2.2 virginica->XXX
->  7.7 2.6 6.9 2.3 virginica->XXX
->  6   2.2 5   1.5 virginica->XXX
->  6.9 3.2 5.7 2.3 virginica->XXX
->  5.6 2.8 4.9 2   virginica->XXX
->  7.7 2.8 6.7 2   virginica->XXX
->  6.3 2.7 4.9 1.8 virginica->XXX
->  6.7 3.3 5.7 2.1 virginica->XXX
->  7.2 3.2 6   1.8 virginica->XXX
->  6.2 2.8 4.8 1.8 virginica->XXX
->  6.1 3   4.9 1.8 virginica->XXX
->  6.4 2.8 5.6 2.1 virginica->XXX
->  7.2 3   5.8 1.6 virginica->XXX
->  7.4 2.8 6.1 1.9 virginica->XXX
->  7.9 3.8 6.4 2   virginica->XXX
->  6.4 2.8 5.6 2.2 virginica->XXX
->  6.3 2.8 5.1 1.5 virginica->XXX
->  6.1 2.6 5.6 1.4 virginica->XXX
->  7.7 3   6.1 2.3 virginica->XXX
->  6.3 3.4 5.6 2.4 virginica->XXX
->  6.4 3.1 5.5 1.8 virginica->XXX
->  6   3   4.8 1.8 virginica->XXX
->  6.9 3.1 5.4 2.1 virginica->XXX
->  6.7 3.1 5.6 2.4 virginica->XXX
->  6.9 3.1 5.1 2.3 virginica->XXX
+++ 5.8 2.7 5.1 1.9 XXX
--- 5.8 2.7 5.1 1.9 virginica
->  6.8 3.2 5.9 2.3 virginica->XXX
->  6.7 3.3 5.7 2.5 virginica->XXX
->  6.7 3   5.2 2.3 virginica->XXX
->  6.3 2.5 5   1.9 virginica->XXX
->  6.5 3   5.2 2   virginica->XXX
->  6.2 3.4 5.4 2.3 virginica->XXX
->  5.9 3   5.1 1.8 virginica->XXX

As you can see, the pair of lines

+++ 5.8 2.7 5.1 1.9 XXX
--- 5.8 2.7 5.1 1.9 virginica

are shown as an addition + deletion, when they are actually a change in a single column.

For some large files--but not in this file--I see trios or more complex patterns of added/deleted/modified lines where changes in the values in two or more rows are displayed as a mix of modifications to unmatched rows, combined with additions + deletions. Something like:

+++ 5.8     2.7     5.1     1.9     XXX
--> 6.8->5.8    3.2-->2.7   5.9->5.1    2.3->1.9    virginical->XXX
--- 5.8     3.2     5.1     1.9     virginica
paulfitz commented 7 years ago

Hi @gwarnes-mdsol, could you do me a favor and attach the .csv files, or forward them by email? (my email address is attached to my github profile). Thanks!

gwarnes-mdsol commented 7 years ago

Sorry about that. BTW, github doesn't like the extension .csv so I added .txt to make it happy.

iris.csv.txt iris2.csv.txt

paulfitz commented 7 years ago

Thanks for the files. From the command line, with daff iris.csv.txt iris2.csv.txt, I'm not seeing the same diff unfortunately, it gives -> updates everywhere. There was an extra column that looked like a row number, but removing it also wasn't sufficient to replicate. How hard would it be to talk me through how to replicate using R?

gwarnes-mdsol commented 7 years ago

Hi Paul, it is pretty simple to replicate in R. I'll try to take some time tomorrow to write brief instructions. In the mean time, installing R would be the first step, :-) http://r-project.org

On Mon, Apr 17, 2017 at 9:46 PM Paul Fitzpatrick notifications@github.com wrote:

Thanks for the files. From the command line, with daff iris.csv.txt iris2.csv.txt, I'm not seeing the same diff unfortunately, it gives -> updates everywhere. There was an extra column that looked like a row number, but removing it also wasn't sufficient to replicate. How hard would it be to talk me through how to replicate using R?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_paulfitz_daff_issues_91-23issuecomment-2D294648437&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=PUo6rYTmGeWkBJepZc1UHw629PctwMWQF8I3RzhQlL8&m=Y09aeUbp46EnkWxCzc6ZJAo3HC8hn4cOFDekMlehE2c&s=5JvZ9XU6ebKlqbYC2CQ0gEs-6DnsLeI85D8a_B-k_fA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVNo-2DrVMkmHDgf8O2EdrjxnHvTYOpItZks5rxBXigaJpZM4M-2DIyB&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=PUo6rYTmGeWkBJepZc1UHw629PctwMWQF8I3RzhQlL8&m=Y09aeUbp46EnkWxCzc6ZJAo3HC8hn4cOFDekMlehE2c&s=xSzxUauheirwQof3g7MQvardno2VWAwF4U1n6bhA5E4&e= .

gwarnes-mdsol commented 7 years ago

Here's the R code to replicate:

install.packages("devtools")
devtools::install_github("edwindj/daff")
library(daff)
iris2 <- iris
levels(iris2$Species)[3] <- "XXX"
df <- diff_data(iris, iris2)
df
render_diff(df)

(Note that the last command render_diff(c) generates and displays a HTML page that has additional features that it might be worth moving into your codebase.)

And the output on my system:

gwarnes@F5KSH06HF9VN:/tmp$ R

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> devtools::install_github("edwindj/daff")
Downloading GitHub repo edwindj/daff@master
from URL https://api.github.com/repos/edwindj/daff/zipball/master
Installing daff
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/jsonlite_1.4.tgz'
Content type 'application/x-gzip' length 1077372 bytes (1.0 MB)
==================================================
downloaded 1.0 MB

Installing jsonlite
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de4a4865b/jsonlite'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘jsonlite’ ...
* DONE (jsonlite)
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/V8_1.4.tgz'
Content type 'application/x-gzip' length 2304654 bytes (2.2 MB)
==================================================
downloaded 2.2 MB

Installing V8
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/Rcpp_0.12.10.tgz'
Content type 'application/x-gzip' length 3020988 bytes (2.9 MB)
==================================================
downloaded 2.9 MB

Installing Rcpp
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de5618b221/Rcpp'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘Rcpp’ ...
* DONE (Rcpp)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de284a1564/V8'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘V8’ ...
* DONE (V8)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de22b005a4/edwindj-daff-a5a97e1'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *source* package ‘daff’ ...
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (daff)
> library(daff)
> iris2 <- iris
> levels(iris2$Species)
[1] "setosa"     "versicolor" "virginica"
> levels(iris2$Species)[3] <- "XXX"
> df <- diff_data(iris, iris2)
> df
Daff Comparison: ‘iris’ vs. ‘iris2’
  First 6 and last 6 patch lines:
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
... ...          ...         ...          ...         ...
    5.7          2.8         4.1          1.3         versicolor
->  6.3          3.3         6            2.5         virginica->XXX
+++ 5.8          2.7         5.1          1.9         XXX
--- 5.8          2.7         5.1          1.9         virginica
->  7.1          3           5.9          2.1         virginica->XXX
... ...          ...         ...          ...         ...
->  6.7          3.3         5.7          2.5         virginica->XXX
->  6.7          3           5.2          2.3         virginica->XXX
->  6.3          2.5         5            1.9         virginica->XXX
->  6.5          3           5.2          2           virginica->XXX
->  6.2          3.4         5.4          2.3         virginica->XXX
->  5.9          3           5.1          1.8         virginica->XXX

> render_diff(df)
>

image

selcham commented 7 years ago

Hi @paulfitz,

I think I'm facing the same issue here. The update does not seems to work with the same use case.

Example:

I've tried to play with the --id flag, but didn't managed to find a way to always make it work

Any idea ? Thanks

FYI, I'm using daff cli 1.3.25 (JS)

gwarnes-mdsol commented 7 years ago

I dropped a line in the R code above. I've fixed above, but I'm also posting it here for clarity:

install.packages("devtools")
devtools::install_github("edwindj/daff")
library(daff)
iris2 <- iris
levels(iris2$Species)[3] <- "XXX"
df <- diff_data(iris, iris2)
df
render_diff(df)
selcham commented 7 years ago

Hi @paulfitz, do you think you have time to look at this issue ? Thanks

paulfitz commented 7 years ago

https://twitter.com/miketaylr/status/873175465321783296

gwarnes-mdsol commented 7 years ago

Simple Example:

ir table:

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.8,2.7,5.1,1.9,"virginica"
5.8,2.7,5.1,1.9,"virginica"

ir2 table:

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.8,2.7,5.1,1.9,"XXX"
5.8,2.7,5.1,1.9,"XXX"

Comparison:

> diff_data(ir, ir2)
Daff Comparison: 'ir' vs. 'ir2' 
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
+++ 5.8          2.7         5.1          1.9         XXX      
+++ 5.8          2.7         5.1          1.9         XXX      
--- 5.8          2.7         5.1          1.9         virginica
--- 5.8          2.7         5.1          1.9         virginica
miachenmtl commented 7 years ago

I'm getting a similar problem with columns--in a table where some columns have duplicate data of other columns, changing a column header, even if it's a column that does not have duplicated data, shows up as an added and deleted column. Using the bridge example on the demo page, change the Designer column so that it's identical to the Bridge column in both the original and the modified version. Then, in the modified version, change Length to something like Span. The Length/Span column appears as added/removed. fireshot capture 22 - daff - data diffs in javascript ruby pyth_ - http___paulfitz github io_daff_