sergi / go-diff

Diff, match and patch text in Go
MIT License
1.81k stars 207 forks source link

Fix line diff by using rune index without a separator #136

Closed kdarkhan closed 1 year ago

kdarkhan commented 1 year ago

The suggested approach for doing line level diffing is the following set of steps:

  1. ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)
  2. diffs = DiffMain(ti1, ti2)
  3. DiffCharsToLines(diff, linesIdx)

The original implementation in google/diff-match-patch uses unicode codepoints for storing indices in ti1 and ti2 joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makes ti1 and ti2 more readable, it introduces bugs when trying to rely on it when doing line level diffing with DiffMain. The root cause of the issue is that an integer line index might span more than one character/rune, and DiffMain can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match 12. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available at TestDiffPartialLineIndex.

In this PR I am adjusting the algorithm to use the same approach as in diff-match-patch by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. During string to rune slice conversion invalid runes will be replaced with utf.RuneError.

The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding

The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes.

In addition to that, the range U+D800 - U+DFFF contains invalid codepoints. and all codepoints higher or equal to 0xD800 are incremented by 0xDFFF - 0xD800.

The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently bails out when files have more than 65535 lines.

kdarkhan commented 1 year ago

Here is the benchmark comparison with the original code and the new code from this PR:

## Original code
BenchmarkDiffCommonPrefix-16                     6884680               153.6 ns/op
BenchmarkDiffCommonSuffix-16                     7614050               152.6 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7482 ns/op
BenchmarkCommonLength/prefix/short-16           468785274                2.285 ns/op
BenchmarkCommonLength/prefix/long-16             2215281               522.5 ns/op
BenchmarkCommonLength/suffix/empty-16           935044214                1.257 ns/op
BenchmarkCommonLength/suffix/short-16           390328588                3.023 ns/op
BenchmarkCommonLength/suffix/long-16             1622370               763.2 ns/op
BenchmarkDiffHalfMatch-16                          11480            107185 ns/op
BenchmarkDiffCleanupSemantic-16                      571           2186640 ns/op
BenchmarkDiffMain-16                                   1        1166282802 ns/op
BenchmarkDiffMainLarge-16                             16          94201231 ns/op
BenchmarkDiffMainRunesLargeLines-16                  394           2833419 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               24          52753756 ns/op

## Code in this PR
BenchmarkDiffCommonPrefix-16                     7782129               151.3 ns/op
BenchmarkDiffCommonSuffix-16                     7802674               155.7 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7727 ns/op
BenchmarkCommonLength/prefix/short-16           461628601                2.292 ns/op
BenchmarkCommonLength/prefix/long-16             2258307               514.1 ns/op
BenchmarkCommonLength/suffix/empty-16           939004714                1.222 ns/op
BenchmarkCommonLength/suffix/short-16           353210576                3.125 ns/op
BenchmarkCommonLength/suffix/long-16             1621677               765.0 ns/op
BenchmarkDiffHalfMatch-16                          10000            109821 ns/op
BenchmarkDiffCleanupSemantic-16                      658           1978913 ns/op
BenchmarkDiffMain-16                                   1        1010640421 ns/op
BenchmarkDiffMainLarge-16                             10         139771776 ns/op
BenchmarkDiffMainRunesLargeLines-16                 1632            722348 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               27          43902502 ns/op
kdarkhan commented 1 year ago

I did not realize that there is an existing PR https://github.com/sergi/go-diff/pull/120 which implements somewhat similar solution.

My PR implements a solution closer to the Javascript version google/diff-match-patch.

sergi commented 1 year ago

LGTM

ffluk3 commented 1 year ago

@sergi could we publish a tag with this in it?

IOrlandoni commented 9 months ago

Re-ping for @sergi