Diff Display Formatting

b5 commented 4 years ago

We need to think about how to format & display diffs. We have two main display formats:

terminal output
react components

Git Combined Diff

Before we get to structure data diffs, I think it makes sense to start with git diffs, and build upward from there.

To get a quick example, I started with one of our recent commits by adding .patch to the end of a github URL: https://github.com/qri-io/qri/commit/98f065115abec6a2e8eae5e682d5a5c4a70d1562.patch

This is the result:

diff --git a/cmd/fsi_integration_test.go b/cmd/fsi_integration_test.go
index 6b7bbb06..235c9700 100644
--- a/cmd/fsi_integration_test.go
+++ b/cmd/fsi_integration_test.go
@@ -992,6 +992,34 @@ meta:
    }
 }

+// Test that diff before save leads to a reasonable error message
+func TestDiffBeforeSave(t *testing.T) {
+   run := NewFSITestRunner(t, "qri_test_diff_before_save")
+   defer run.Delete()
+
+   workDir := run.CreateAndChdirToWorkDir("diff_before")
+
+   // Init as a linked directory.
+   run.MustExec(t, "qri init --name diff_change --format csv")
+
+   // Verify the directory contains the files that we expect.
+   dirContents := listDirectory(workDir)
+   expectContents := []string{".qri-ref", "body.csv", "meta.json", "structure.json"}
+   if diff := cmp.Diff(expectContents, dirContents); diff != "" {
+       t.Errorf("directory contents (-want +got):\n%s", diff)
+   }
+
+   // Diff should return the expected error message
+   err := run.ExecCommand("qri diff")
+   if err == nil {
+       t.Fatal("expected error trying to init, did not get an error")
+   }
+   expect := `dataset has no versions, nothing to diff against`
+   if err.Error() != expect {
+       t.Errorf("error mismatch, expect: %s, got: %s", expect, err.Error())
+   }
+}
+
 // Test that if the meta component fails to write, init will rollback
 func TestInitMetaFailsToWrite(t *testing.T) {
    run := NewFSITestRunner(t, "qri_test_init_meta_fail")
diff --git a/lib/diff.go b/lib/diff.go
index 351f979f..99a3b716 100644
--- a/lib/diff.go
+++ b/lib/diff.go
@@ -93,7 +93,10 @@ func (r *DatasetRequests) Diff(p *DiffParams, res *DiffResponse) (err error) {
        return err
    }
    err = repo.CanonicalizeDatasetRef(r.inst.node.Repo, &ref)
-   if err != nil && err != repo.ErrNoHistory {
+   if err != nil {
+       if err == repo.ErrNoHistory {
+           return fmt.Errorf("dataset has no versions, nothing to diff against")
+       }
        return err
    }
    ds, err := dsfs.LoadDataset(ctx, r.inst.node.Repo.Store(), ref.Path)

The output is both human readable an machine parsable, and forms the basis for a lot of diff-representation UI.

The git documentation is the best place to start for understanding the format. git's combined diff format

The git diff format itself is the source of the three lines of contextual padding.

Constructing an example manually

Looking at the above, I've constructed a patch by hand to use as an example. It removes three lines, adds one, and has the three lines of context:

--- a/readme.md
+++ b/readme.md
@@ -0,0 +0,0 @@ # World Bank Population
 ### Only Existing Territorial Regions

-The world Bank dataset of population is just silly!
-If you add up the “population” column, you don’t get the expected
-7 billion number, because they include “significant regions”.
+This is a dataset of the World Bank Population without Significant Regions.
 The most important characteristic of this dataset is totalling the “popluation”
 column will equal the total estimated population for a given year.
 Other than that, the datasets are exactly the same.

To build a react component out of this, I think it's easiest to build a JSON data structure representation of this unix format. A first stab:

{
  "meta" : {
    "---": "a/readme.md",
    "+++": "b/readme.md",
    "added": 1,
    "removed": 3 
  },
  "patch": [
    [[ [1,9],[1,7] ], "# World Bank Population"],
    [" ", "### Only Existing Territorial Regions"],
    [" ", ""],
    ["-", "The world Bank dataset of population is just silly!"],
    ["-", "If you add up the “population” column, you don’t get the expected"],
    ["-", "7 billion number, because they include “significant regions”."],
    ["+", "This is a dataset of the World Bank Population without Significant Regions."],
    [" ", "The most important characteristic of this dataset is totalling the “popluation”"],
    [" ", "column will equal the total estimated population for a given year."],
    [" ", "Other than that, the datasets are exactly the same."],
  ]
}

the patch structure is a tuple that mimics the text output of the git combined diff, using a space to indicate no change, + for add and - for remove. If the first entry is an array type, it's an equivelat to @@ in git diff. The [ [1,8], [1,5] ] are line change ranges. The second element in the tuple is always the text in question.

cc @dustmop

b5 commented 4 years ago

Dealing with Objects

We've been relying on contiguous array representation. This isn't possible with objects, which don't have a defined total order. It's also a problem if we'd ever like to skip indices in array diff output, which I think we do.

Let's diff two objects:

djs.json

{
  "dj dj booth": { "rating": 1, "uses_soundcloud": true },
  "dj wipeout": { "rating": 2, "uses_soundcloud": true },
  "com truise": { "rating": 4, "uses_soundcloud": true },
  "Ryan Hemsworth": { "rating": 4, "uses_soundcloud": true }
}

djs_edits.json

{
  "DJ dj booth": { "rating": 1, "uses_soundcloud": true },
  "DJ wipeout": { "rating": 2, "uses_soundcloud": true },
  "com truise": { "rating": 4, "uses_soundcloud": true },
  "Ryan Hemsworth": { "rating": 4, "uses_soundcloud": true },
  "Susan Collins": { "rating": 3, "uses_soundcloud": false },
}

here's a first crack at a hand-constructed diff:

{
  "header": {
    "-": "djs.json",
    "+": "djs_edits.json",
    "stat": { "..." : null }
  },
  "script": [
    [["-","com truise", "Ryan Hemsworth"],["+","com truise", "Susan Collins"]],
    [" ","com truise", { "rating": 4, "uses_soundcloud": true }],
    ["-", "dj dj Booth", { "rating": 1, "uses_soundcloud": true }],
    ["+", "DJ dj Booth", { "rating": 1, "uses_soundcloud": true }],
    ["-", "dj wipeout", { "rating": 2, "uses_soundcloud": true }],
    ["+", "DJ wipeout", { "rating": 2, "uses_soundcloud": true }],
    [" ", "Ryan Hemsworth", { "rating": 4, "uses_soundcloud": true }],
    ["+", "Susan Collins", { "rating": 3, "uses_soundcloud": false }]
  ]
}

To do object patching properly, it's easiest if we add another entry to the change tuple that specifies the key. In array changes we can either list a numeric or string index

b5 commented 4 years ago

Update on this that came from implementing qri-io/deepdiff#6:

The structure of a diff tuple is:

 tuple = [ change_type, key, value, []tuple ]

The key element in the tuple is definitely needed, and can be one of three types: null, string, and number. strings are still used as object keys, numbers are used for indexing into array indices, and null is used for referencing the root value itself.
The value element defines the current element. If value is non-null, the tuple should be 3 elements long (no []tuple section)
If the tuple has a []tuple element, the value element is ignored. a 4-element tuple indicates we're dropping into a compound type to describe changes
We still haven't figured out the syntax for format context jumps (the @@ part of a git diff)

b5 commented 4 years ago

This new output format is now in qri diff

qri diff --format json a.csv b.csv | jq

qri-io / rfcs