Closed kdarkhan closed 1 year ago
Here is the benchmark comparison with the original code and the new code from this PR:
## Original code
BenchmarkDiffCommonPrefix-16 6884680 153.6 ns/op
BenchmarkDiffCommonSuffix-16 7614050 152.6 ns/op
BenchmarkCommonLength/prefix/empty-16 1000000000 0.7482 ns/op
BenchmarkCommonLength/prefix/short-16 468785274 2.285 ns/op
BenchmarkCommonLength/prefix/long-16 2215281 522.5 ns/op
BenchmarkCommonLength/suffix/empty-16 935044214 1.257 ns/op
BenchmarkCommonLength/suffix/short-16 390328588 3.023 ns/op
BenchmarkCommonLength/suffix/long-16 1622370 763.2 ns/op
BenchmarkDiffHalfMatch-16 11480 107185 ns/op
BenchmarkDiffCleanupSemantic-16 571 2186640 ns/op
BenchmarkDiffMain-16 1 1166282802 ns/op
BenchmarkDiffMainLarge-16 16 94201231 ns/op
BenchmarkDiffMainRunesLargeLines-16 394 2833419 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16 24 52753756 ns/op
## Code in this PR
BenchmarkDiffCommonPrefix-16 7782129 151.3 ns/op
BenchmarkDiffCommonSuffix-16 7802674 155.7 ns/op
BenchmarkCommonLength/prefix/empty-16 1000000000 0.7727 ns/op
BenchmarkCommonLength/prefix/short-16 461628601 2.292 ns/op
BenchmarkCommonLength/prefix/long-16 2258307 514.1 ns/op
BenchmarkCommonLength/suffix/empty-16 939004714 1.222 ns/op
BenchmarkCommonLength/suffix/short-16 353210576 3.125 ns/op
BenchmarkCommonLength/suffix/long-16 1621677 765.0 ns/op
BenchmarkDiffHalfMatch-16 10000 109821 ns/op
BenchmarkDiffCleanupSemantic-16 658 1978913 ns/op
BenchmarkDiffMain-16 1 1010640421 ns/op
BenchmarkDiffMainLarge-16 10 139771776 ns/op
BenchmarkDiffMainRunesLargeLines-16 1632 722348 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16 27 43902502 ns/op
I did not realize that there is an existing PR https://github.com/sergi/go-diff/pull/120 which implements somewhat similar solution.
My PR implements a solution closer to the Javascript version google/diff-match-patch
.
LGTM
@sergi could we publish a tag with this in it?
Re-ping for @sergi
The suggested approach for doing line level diffing is the following set of steps:
ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)
diffs = DiffMain(ti1, ti2)
DiffCharsToLines(diff, linesIdx)
The original implementation in
google/diff-match-patch
uses unicode codepoints for storing indices inti1
andti2
joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makesti1
andti2
more readable, it introduces bugs when trying to rely on it when doing line level diffing withDiffMain
. The root cause of the issue is that an integer line index might span more than one character/rune, andDiffMain
can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match12
. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available atTestDiffPartialLineIndex
.In this PR I am adjusting the algorithm to use the same approach as in diff-match-patch by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. During string to rune slice conversion invalid runes will be replaced with
utf.RuneError
.The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding
The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes.
In addition to that, the range
U+D800 - U+DFFF
contains invalid codepoints. and all codepoints higher or equal to0xD800
are incremented by0xDFFF - 0xD800
.The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently bails out when files have more than 65535 lines.