Open hirochachacha opened 6 years ago
We could vendor in github.com/google/go-cmp/cmp
, which has a diff algorithm built in. When working on the standard library tests, I have wanted to reach for that package several times.
The cmp.Diff
function was designed precisely for this purpose.
@dsnet I tried. but it seems to me that it doesn't help multi-line cases.
package main
import (
"fmt"
"github.com/google/go-cmp/cmp"
)
func main() {
fmt.Println(cmp.Diff("a\nb\nc", "a\nb1\nc"))
}
output:
{string}:
-: "a\nb\nc"
+: "a\nb1\nc"
Oh, I understand.
package main
import (
"fmt"
"strings"
"github.com/google/go-cmp/cmp"
)
func main() {
fmt.Println(cmp.Diff(strings.Split("a\nb\nc", "\n"), strings.Split("a\nb1\nc", "\n")))
}
output:
{[]string}[1]:
-: "b"
+: "b1"
It seems to make more sense to encourage experimentation outside and then consider whether to bring a particular outside one in. There are also tests that run diff, and it would be nice to unify them.
Note that cmp is a lot bigger than just this one diff issue.
Putting on hold for a more specific proposal--should we use cmp, or introduce something new?
This issue might be superseded by #41980 and #45200.
I have made some progress on a text diff implementation that I think would be appropriate for the standard library. Cmp seems related but not the same, since it is about structured data, not plain text files.
Cmp seems related but not the same, since it is about structured data, not plain text files.
Isn't a diffing algorithm primarily about determining how to conform one sequence of symbols into another sequence of symbols regardless of whether it is text or not? It seems to me that we should expose a higher level diffing algorithm for any sequence of symbols and make the "text diff implementation" just use that under the hood.
The API of github.com/google/go-cmp/cmp/internal/diff
and github.com/pkg/diff/myers
are similar in that the primary function operates on a signature like:
func(nx, ny int, f func(i, j) bool) []Operation
where nx
and ny
are the lengths of the list, and f
reports whether two symbols at the provided index are equal and Operation
is a enumeration of equal
, insertion
, and removal
.
The diffing algorithm in github.com/google/go-cmp/cmp/internal/diff
additionally makes use of a modify
operator, which may require a signature like:
func(nx, ny int, f func(i, j) float64) []Operation
where f
returning 0.0
could mean completely not equal, and 1.0
as completely equal, and any value in-between as a measure of how close to equal. The algorithm can use that information to discern when to use the modify operation or not.
I believe this issue, #41980, and #45200 should all ideally share the same diffing algorithm.
\cc @josharian
It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?
Retitling to make clear that we should add this in a way that everyone can import it. Still on hold.
It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?
Is there a description of the algorithm that you're thinking of? The Eugene Myers algorithm that I believe @josharian implemented (and which inspired github.com/google/go-cmp/internal/diff
) is defined in terms of "determining the differences between two sequences of symbols". Abstracting the difference algorithm to operate on "symbols" (regardless of whether the "symbol" is line of text, a Go value, or something else entirely) is I believe a powerful abstraction that would be unfortunate to lose.
How does cmp/internal/diff get used by diff itself?
The main way is to provide diffing for two Go slices, where the symbols being compared are two arbitrary Go values.
It is also used by cmp
to provide specialized diffing of a Go string
or []byte
. Examples:
Providing a diff for arbitrary text, where each symbol is a rune
And taking this even farther, for natural language text, it is better for each symbol to be a Unicode grapheme cluster.
Many test code are something like:
However, these kind of code are pretty useless when we expected multi-line values. For example,
So, I propose to support diff algorithm for making error messages human-readable.
The simplest implementation is executing the
diff
command. It doesn't work on windows. But we can use fallback, so it can be a good starting point.I'm not sure that supporting the "diff" package officially is worth doing. But "internal" packages seem harmless.