sergi / go-diff

Diff, match and patch text in Go
MIT License
1.84k stars 210 forks source link

Investigate and implement commonSuffixLength binary search #54

Closed zimmski closed 6 years ago

zimmski commented 8 years ago

There is a TODO marker in the code for a binary search version of commonSuffixLength.

maksimov commented 7 years ago

@zimmski, I've tried it in a separate branch here: https://github.com/maksimov/go-diff/blob/binary-suffix-tests/diffmatchpatch/diff.go#L481 You will note that I had to use reflect.DeepEqual for slice comparison.

There is a benchmark method here: https://github.com/maksimov/go-diff/blob/binary-suffix-tests/diffmatchpatch/diff_test.go#L1440

Benchmark results are as follows:

BenchmarkCommonSuffixLength/Linear(454)-8            1000000          2230 ns/op
BenchmarkCommonSuffixLength/Binary(454)-8           10000000           228 ns/op
...
BenchmarkCommonSuffixLength/Linear(303)-8             300000          5136 ns/op
BenchmarkCommonSuffixLength/Binary(303)-8           10000000           228 ns/op
...
BenchmarkCommonSuffixLength/Linear(4023)-8            200000         10828 ns/op
BenchmarkCommonSuffixLength/Binary(4023)-8          10000000           227 ns/op

In brackets, I output the suffix length. A few runs demonstrated a slow-down for the linear search as the suffix length increases, and almost no change at all for the binary search, which produces excellent results overall.

Here's how to run:

go test ./... -bench="BenchmarkCommonSuffixLength" .
zimmski commented 7 years ago

The benchmark looks amazing! :+1: Have you found out why this was commented out?

Let's submit this as a PR and work in the PR on it. Some initial thoughts:

josharian commented 6 years ago

Sorry to be the bearer of bad news, but the benchmark is seriously flawed. Due to the way the rune slices are constructed:

https://github.com/maksimov/go-diff/blob/f425388e4c0a50f84a09b809bab2d1e9f2d2f18d/diffmatchpatch/diff_test.go#L1442-L1445

the runes slices are identical, not just in their contents but also their backing array. (Observe that in s2 := append(s1[n:], 't'), s2 is set up to alias s1, and s1 is also modified in the process.)

So the suffix length is always the length of the slice. And more important, reflect has special O(1) handling of identical slices:

https://github.com/golang/go/blob/master/src/reflect/deepequal.go#L81

You can see this in action by modifying the Intn parameter and observing that the binary search benchmark doesn't change at all, even if size changes by orders of magnitudes. But note that in basically every real world case, the rune slices won't alias each other (particularly given the exported API).

If you use a more realistic benchmark setup, like:

    sz := 10000
    s1 := randSeq(sz)
    s2 := make([]rune, sz+1)
    n := rand.Intn(len(s1))
    copy(s2, s1[:n])
    s2[n] = '\t'
    copy(s2[n+1:], s1[n:])

then you get results like this:

BenchmarkCommonSuffixLength/Binary(9104,_10000)-8               5000         78562 ns/op        8184 B/op       1836 allocs/op
BenchmarkCommonSuffixLength/Binary(9104,_10000)-8               5000         80758 ns/op        8184 B/op       1836 allocs/op
BenchmarkCommonSuffixLength/Binary(9104,_10000)-8               5000         79645 ns/op        8184 B/op       1836 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8             300000          1074 ns/op           0 B/op          0 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8             300000          1033 ns/op           0 B/op          0 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8             300000          1038 ns/op           0 B/op          0 allocs/op

This is more like what I'd expect.

Note that in the blog post, the author writes:

In the year since this was first published I've been taking a lot of heat from programmers who look at the algorithm and realize what the substring operation is doing at the processor level. They write back, call me up, or burst our laughing when they realize that my 'optimization' actually evaluates to O(n log n). I patiently explain that high-level languages don't work that way, ...

But Go is not a high-level language in the relevant sense. It doesn't always intern strings, and while there are optimized vectorized routines for string comparison, in the general case, I very strongly suspect they're not going to be enough to overcome the O(n log n) factor.

So--sadly--I think this is a non-starter.

However, I do have a simple patch that squeezes a couple of percent of performance out of commonPrefixLength. I'd be happy to send it (and more) as a PR, but it doesn't appear that this project is being actively maintained (?).

But the biggest bottleneck in both commonPrefixLength and commonSuffixLength is the string to []rune conversion, which can be addressed in two ways: (1) exporting different API in this package to allow the user to circumvent this cost, (2) making the conversion faster in the Go compiler and runtime (which is probably possible, much more time has been spend on string -> []byte conversion).

maksimov commented 6 years ago

Thanks @josharian! It's true I haven't been active myself, but I think you still should send your patches (especially if you already have them!)