proposal: add golang.org/x/text/diff package

hirochachacha commented 6 years ago

Many test code are something like:

func TestAAA(t *testing.T) {
  if got != want {
    t.Errorf("got %v, want %v")
  }
}

However, these kind of code are pretty useless when we expected multi-line values. For example,

--- FAIL: TestExamples (0.00s)
    example_test.go:186: Import: got Play == "package main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"os/exec\"\n)\n\nfunc main() {\n\tout, err := exec.Command(\"date\").Output()\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tfmt.Printf(\"The date is %s\\n\", out)\n}\n", want "package main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"os/exec\"\n)\n\n// aaa\n\nfunc main() {\n    out, err := exec.Command(\"date\").Output()\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tfmt.Printf(\"The date is %s\\n\", out)\n}\n"
FAIL

So, I propose to support diff algorithm for making error messages human-readable.

The simplest implementation is executing the diff command. It doesn't work on windows. But we can use fallback, so it can be a good starting point.

I'm not sure that supporting the "diff" package officially is worth doing. But "internal" packages seem harmless.

dsnet commented 6 years ago

We could vendor in github.com/google/go-cmp/cmp, which has a diff algorithm built in. When working on the standard library tests, I have wanted to reach for that package several times.

The cmp.Diff function was designed precisely for this purpose.

hirochachacha commented 6 years ago

@dsnet I tried. but it seems to me that it doesn't help multi-line cases.

package main

import (
    "fmt"

    "github.com/google/go-cmp/cmp"
)

func main() {
    fmt.Println(cmp.Diff("a\nb\nc", "a\nb1\nc"))
}

output:

{string}:
    -: "a\nb\nc"
    +: "a\nb1\nc"

hirochachacha commented 6 years ago

Oh, I understand.

package main

import (
    "fmt"
    "strings"

    "github.com/google/go-cmp/cmp"
)

func main() {
    fmt.Println(cmp.Diff(strings.Split("a\nb\nc", "\n"), strings.Split("a\nb1\nc", "\n")))
}

output:

{[]string}[1]:
    -: "b"
    +: "b1"

rsc commented 6 years ago

It seems to make more sense to encourage experimentation outside and then consider whether to bring a particular outside one in. There are also tests that run diff, and it would be nice to unify them.

Note that cmp is a lot bigger than just this one diff issue.

ianlancetaylor commented 6 years ago

Putting on hold for a more specific proposal--should we use cmp, or introduce something new?

dsnet commented 3 years ago

This issue might be superseded by #41980 and #45200.

rsc commented 3 years ago

I have made some progress on a text diff implementation that I think would be appropriate for the standard library. Cmp seems related but not the same, since it is about structured data, not plain text files.

dsnet commented 3 years ago

Cmp seems related but not the same, since it is about structured data, not plain text files.

Isn't a diffing algorithm primarily about determining how to conform one sequence of symbols into another sequence of symbols regardless of whether it is text or not? It seems to me that we should expose a higher level diffing algorithm for any sequence of symbols and make the "text diff implementation" just use that under the hood.

The API of github.com/google/go-cmp/cmp/internal/diff and github.com/pkg/diff/myers are similar in that the primary function operates on a signature like:

func(nx, ny int, f func(i, j) bool) []Operation

where nx and ny are the lengths of the list, and f reports whether two symbols at the provided index are equal and Operation is a enumeration of equal, insertion, and removal.

The diffing algorithm in github.com/google/go-cmp/cmp/internal/diff additionally makes use of a modify operator, which may require a signature like:

func(nx, ny int, f func(i, j) float64) []Operation

where f returning 0.0 could mean completely not equal, and 1.0 as completely equal, and any value in-between as a measure of how close to equal. The algorithm can use that information to discern when to use the modify operation or not.

I believe this issue, #41980, and #45200 should all ideally share the same diffing algorithm.

\cc @josharian

rsc commented 3 years ago

It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?

rsc commented 3 years ago

Retitling to make clear that we should add this in a way that everyone can import it. Still on hold.

dsnet commented 3 years ago

It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?

Is there a description of the algorithm that you're thinking of? The Eugene Myers algorithm that I believe @josharian implemented (and which inspired github.com/google/go-cmp/internal/diff) is defined in terms of "determining the differences between two sequences of symbols". Abstracting the difference algorithm to operate on "symbols" (regardless of whether the "symbol" is line of text, a Go value, or something else entirely) is I believe a powerful abstraction that would be unfortunate to lose.

How does cmp/internal/diff get used by diff itself?

The main way is to provide diffing for two Go slices, where the symbols being compared are two arbitrary Go values.

It is also used by cmp to provide specialized diffing of a Go string or []byte. Examples:

Providing a diff for arbitrary binary data, where each symbol is a byte
Providing a diff for arbitrary text, where each symbol is a rune
Providing a diff for strings that look like multilined text data, where each symbol is a line of text

josharian commented 3 years ago

Providing a diff for arbitrary text, where each symbol is a rune

And taking this even farther, for natural language text, it is better for each symbol to be a Unicode grapheme cluster.

golang / go

proposal: add golang.org/x/text/diff package #23113