ropensci / unconf15

rOpenSci's San Francisco hackathon/unconf 2015
http://unconf.ropensci.org
35 stars 7 forks source link

Data 'diff' format #19

Open rdpeng opened 9 years ago

rdpeng commented 9 years ago

One thing I've always wanted is a 'diff' type output for datasets (let's say tabular datasets for now). When I use git to manage projects, changes to the datasets I use are difficult to visualize using the standard diff output, which is line based. That works when rows are changed but not when columns are added/deleted or transformations are made. Is there a way to categorize the types of changes that can be made to a dataset and then visualize them in a useful way?

srvanderplas commented 9 years ago

I would be most interested in this as well.

On Sun, Mar 1, 2015 at 11:10 AM, Roger D. Peng notifications@github.com wrote:

One thing I've always wanted is a 'diff' type output for datasets (let's say tabular datasets for now). When I use git to manage projects, changes to the datasets I use are difficult to visualize using the standard diff output, which is line based. That works when rows are changed but not when columns are added/deleted or transformations are made. Is there a way to categorize the types of changes that can be made to a dataset and then visualize them in a useful way?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/19.

rdpeng commented 9 years ago

Awesome! I'm not 100% sure how this would work, but I think for it to be useful it would have to sit on top of git and then maybe show how the dataset changes independent of git's own output. The issue there would then be efficiency....

ledell commented 9 years ago

@rdpeng, are you familar with dat? It's a version control system for data, which feels very similar to git. At the moment, I think it tracks modifications by row only, but I am hoping that column-based diffs will be part of a future release. I agree that something that tracks a variety of transformations/modifications would be very useful.

jennybc commented 9 years ago

:+1: I was just talking to @gvwilson about this exact thing earlier this week….

jeroen commented 9 years ago

Also have a look at https://github.com/edwindj/daff

karthik commented 9 years ago

Thanks @ledell for mentioning Dat. I'm on my phone so this will be brief but I'll expand later. Dat can natively do diffs and ropensci has a rDat package in the works, waiting on Dat to come to beta (which is soon). I've invited the Dat project to join us and Karissa from their team will join us.

There are some issues with rDat that I'm hoping Jeroen will help resolve. But :100: to pursuing this idea. It should be easy to complete at the event.

rdpeng commented 9 years ago

Daff looks quite good actually, and seems to implement most of what I was thinking about.

One thing I was hoping to do was implement was something a bit more "intelligent" (and likely more constraining). So for example, if I transform a column by squaring it, is there a way to show that rather than just indicating that every value in the column changed? Perhaps a diff could be expressed via R code rather than the something along the lines of the usual +/- diff format.

gmbecker commented 9 years ago

@rdpeng your latest comment sounds like a provenance-tracking problem. Are you thinking this will be applied in a system aware of what is done, or does it need to work like diff, I.e. given two datasets and no extra information, tell me the differences?

jennybc commented 9 years ago

Pardon if this is slightly off-topic but I want to park these links in a few relevant places, like this thread.

Re: weaning people off of Excel for data inspection and cleaning. OpenRefine comes up a lot and is generally popular with people expecting a GUI. I had always thought it was only mouse driven, but that it wrote some sort of log file. Did not realize these logs are perhaps re-executable. But a recent Twitter conversation intrigues me and also alerted me to Ruby and Python wrappers around the underlying Refine API. @ostephens says:

rdpeng commented 9 years ago

I'm not sure, to be honest. I think I would need a brief discussion of the pros and cons of either one.

On Mon, Mar 9, 2015 at 5:22 PM, Gabe Becker notifications@github.com wrote:

@rdpeng https://github.com/rdpeng your latest comment sounds like a provenance-tracking problem. Are you thinking this will be applied in a system aware of what is done, or does it need to work like diff, I.e. given two datasets and no extra information, tell me the differences?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/19#issuecomment-77945678.

Roger D. Peng | @rdpeng https://twitter.com/rdpeng | http://www.biostat.jhsph.edu/~rpeng/

sjackman commented 9 years ago

I'm interested in this topic.

vsbuffalo commented 9 years ago

Me too! I really like this idea. In the Unix tradition, I think the best approach to an implementation might be a C or C++ library and command line tool (e.g. like curl). The, we could maybe write a simple R wrapper.

gmbecker commented 9 years ago

Maybe, though that requires us to write it in C/C++ instead of the much easier R. There are benefits (though not that many to R users), but pretty major downsides too.

I would argue that - for prototyping algorithms and features, at least - implementing it initially in R is a more efficient use of our time.

Remember what Duncan always said: for every two lines of C you write, you introduce 3 bugs.

~G

On Mon, Mar 23, 2015 at 10:18 PM, Vince Buffalo notifications@github.com wrote:

Me too! I really like this idea. In the Unix tradition, I think the best approach to an implementation might be a C or C++ library and command line tool (e.g. like curl). The, we could maybe write a simple R wrapper.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/19#issuecomment-85344082.

Gabriel Becker, PhD Computational Biologist Bioinformatics and Computational Biology Genentech, Inc.

bbest commented 9 years ago

There's a slick little Chrome extension CSVHub that will visualize the daff like differencing of a CSV from within Github:

sjackman commented 9 years ago

@bbest Wow! That's fantastic! Oddly it doesn't work with TSV files. I've opened an issue https://github.com/theodi/csvhub/issues/8 to request this feature.

sjackman commented 9 years ago

These are the git aliases that I use for diffing TSV and CSV files.

[alias]
    wdiff = diff --word-diff=plain
    wdiffc = diff --word-diff=color
    wdiffcsv = diff --word-diff=color --word-diff-regex=[^,]+

See https://github.com/sjackman/dotfiles/blob/master/.gitconfig#L3-L5

screenshot 2015-03-26 14 48 35

jordansread commented 9 years ago

:+1:

jules32 commented 9 years ago

thanks for sharing @sjackman :)

karthik commented 9 years ago

Nice, @sjackman!

jeroen commented 9 years ago

Very cool. So you mentioned git uses wdiff under the hood to diff code by word as well?

okdistribute commented 9 years ago

I took some notes from our conversation today. Thanks for contributing to the workshop! https://github.com/karissa/dat-visualdiff/issues/1

bbest commented 9 years ago

Good one @sjackman! Here's my little play session with trying out this technique...

# add alias to git's config
git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"

# initialize repo
git init test_csv; cd test_csv

# 1st commit of test csv
echo -e 'a,b,c\n1,2,3\n4,5,6' > x.csv; cat x.csv
git add x.csv; git commit -m 'initial csv'

# modify csv: b->c, 4->8
echo -e 'a,c,d\n1,2,3\n8,5,6' > x.csv; cat x.csv 

# compare against previous commit
git diff x.csv 
git diffcsv x.csv 

# 2nd commit on modified csv: b->c, 4->8
git commit -a -m 'modified csv'

# modify csv: +e column with 0's
echo -e 'a,c,d,e\n1,4,3,0\n8,5,6,0' > x.csv 

# compare against previous commit 
git diffcsv x.csv

# 3rd commit on modified csv: +e column with 0's
git commit -a -m 'modified csv again'

# look at history of commits
git log

# compare between specific commits of the csv (swapping from your git log output)
git diffcsv 56515ac..97bfd69 -- x.csv 
sjackman commented 9 years ago

daff works really well!

screenshot 2015-03-27 09 54 33

jules32 commented 9 years ago

Following up with @sjackman and @bbest's examples: I moved @bbest's script into R since for us this kind of visual differencing would need to be portable (ie to be able to share it with colleagues outside of your own terminal window).

Unfortunately RStudio doesn't do the color differencing (would that even be possible?) and in fact the display is not useful. What would further options be? @karissa?

Examples and full R script below.

Comparing git diffcsv x.csv from @bbest's code above run in Terminal and RStudio:

image

image

R translation of @bbest's bash script above

# add alias to git's config
system('git config --global alias.diffcsv "diff --word-diff=color --word-diff-regex=[^,]+"')

# initialize repo
system('git init test_csv; cd test_csv')

# 1st commit of test csv
x = data.frame(a = c(1,4), b = c(2,5), c = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)

system("git add x.csv; git commit -m 'initial csv'")

# modify csv: b->c, c->d, 4->8
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6)); x
write.csv(x, 'x.csv', row.names = F)

# compare against previous commit
system('git diff x.csv') 
system('git diffcsv x.csv') 

# 2nd commit on modified csv: b->c, 4->8
system("git commit -a -m 'modified csv'")

# modify csv: +e column with 0's
x = data.frame(a = c(1,8), c = c(2,5), d = c(3,6), e = c(0,0)); x
write.csv(x, 'x.csv', row.names = F)

# compare against previous commit 
system('git diffcsv x.csv')

# 3rd commit on modified csv: +e column with 0's
system("git commit -a -m 'modified csv again'")

# look at history of commits
system('git log')

# compare between specific commits of the csv (swapping from your git log output)
system('git diffcsv a4c1add0..5cf47e62 -- x.csv') 
sjackman commented 9 years ago

It's not as pretty, but you can use --word-diff=plain instead of --word-diff=color inside of RStudio. It'll use [-foo-] to indicate removed text and {+bar+} to indicate added text.

❯❯❯ git diff --word-diff=plain --word-diff-regex='[^,]+' foo.csv bar.csv
diff --git a/foo.csv b/bar.csv
index cbe2c72..46c6789 100644
--- a/foo.csv
+++ b/bar.csv
@@ -1,3 +1,3 @@
A,B,C
1,2,3
4,[-5-]{+7+},6
sjackman commented 9 years ago

It's possible that RStudio could render the ANSI colour code escape sequences. Certainly no harm in opening an issue with a feature request.

bbest commented 9 years ago

By the way, pushing the test_csv repo created above to https://github.com/bbest/test_csv and viewing in Google Chrome with CSVHub nicely renders the differences between the following csv commits:

  1. 56515ac initial csv

    image

  2. db8644a modified csv: b->c, c->d, 4->8

    image

  3. 97bfd69 modified csv again: +e column with 0's

    image

And now for the comparisons using daff style differencing (green add, red delete, blue modify) with the CSVHub Google Chrome extension:

jules32 commented 9 years ago

Thanks @bbest!

On Mon, Mar 30, 2015 at 3:05 PM, Ben Best notifications@github.com wrote:

By the way, pushing the test_csv repo created above to https://github.com/bbest/test_csv and viewing in Google Chrome with CSVHub https://chrome.google.com/webstore/detail/csvhub/dbemglgpbebafkibfncdpdmdikacingf nicely renders the differences between the following csv commits:

1.

56515ac initial csv

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907627/b417caec-d6ec-11e4-894b-8de6f330e278.png 2.

db8644a modified csv: b->c, c->d, 4->8

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907643/e060458e-d6ec-11e4-8ce4-252d4aa41a13.png 3.

97bfd69 modified csv again: +e column with 0's

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907638/d4339040-d6ec-11e4-8353-86be4c223ff8.png

And now for the comparisons using daff style differencing (green add, red delete, blue modify) with the CSVHub https://chrome.google.com/webstore/detail/csvhub/dbemglgpbebafkibfncdpdmdikacingf Google Chrome extension:

-

bbest/test_csv@56515ac...db8644a https://github.com/bbest/test_csv/compare/56515ac...db8644a

1st to 2nd: b->c, c->d, 4->8

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907701/465aa78a-d6ed-11e4-8655-1472b0503afe.png

bbest/test_csv@db8644a...97bfd69 https://github.com/bbest/test_csv/compare/db8644a...97bfd69

2nd to 3rd: +e column with 0's

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907719/70ef4596-d6ed-11e4-8670-14d2d6c7df6f.png

bbest/test_csv@56515ac...97bfd69 https://github.com/bbest/test_csv/compare/56515ac...97bfd69 1st to 3rd

1st to 3rd: b->c, c->d, 4->8, +e column with 0's

[image: image] https://cloud.githubusercontent.com/assets/2837257/6907747/ac2ac4e6-d6ed-11e4-9f20-ded7d6355331.png

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/19#issuecomment-87849148.

Julia Stewart Lowndes, PhD Project Scientist, Ocean Health Index http://www.oceanhealthindex.org National Center for Ecological Analysis and Synthesis (NCEAS http://www.nceas.ucsb.edu) University of California, Santa Barbara 735 State Street, Suite 300 Santa Barbara, CA, 93101, USA Phone: 1-805-893-7523