sergi / go-diff

Diff, match and patch text in Go
MIT License
1.84k stars 209 forks source link

Munged text leads to incorrect diffs #140

Open schroederc opened 1 year ago

schroederc commented 1 year ago

https://github.com/sergi/go-diff/commit/db1b095f5e7c905e196ff6bfd56189a41aa76309 introduces a bug in its change from diffLinesToRunesMunge to diffLinesToStringsMunge. Since each line is represented by 1 or more ascii characters, it's possible for the diffing algorithm to split hashed lines incorrectly whereas before the rune indexed lines were indivisible.

For instance, DiffLinesToChars could return hashed strings such as:

"1,2,3,4,5,6,7,8,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,35,36,37,38,39,40,41"
"42,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,43,36,37,38,39,44,45,46,47"

DiffMain may then split the leading 42 such as:

[{Delete 1,2,3,} {Equal 4} {Delete ,5,6,7,8} {Insert 2} {Equal ,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,} {Insert 4} {Equal 3} {Delete 5} {Equal ,36,37,38,39,4} {Delete 0} {Insert 4} {Equal ,4} {Delete 1} {Insert 5,46,47}] 

And the resulting diff after hydration is completely wrong.

This affects users of the DiffLinesTo* APIs as well as any user that passes true for checklines in DiffMain or DiffMainRunes.

schroederc commented 1 year ago

Here's a rough test case:

func TestLineDiff(t *testing.T) {
    t.Run("Chars", func(t *testing.T) {
        before := `1
2
3
4
5
6
7
8
9
`
        after := `10
`

        dmp := diffmatchpatch.New()
        txt1, txt2, lines := dmp.DiffLinesToChars(string(before), string(after))
        diff := dmp.DiffMain(txt1, txt2, false)
        diff = dmp.DiffCharsToLines(diff, lines)

        var foundBefore, foundAfter string
        for _, d := range diff {
            switch d.Type {
            case eq:
                foundBefore += d.Text
                foundAfter += d.Text
            case del:
                foundBefore += d.Text
            case ins:
                foundAfter += d.Text
            }
        }

        if foundBefore != before {
            t.Errorf("Expected before %q; found %q", before, foundBefore)
        }
        if foundAfter != after {
            t.Errorf("Expected after %q; found %q", after, foundAfter)
        }
    })

    t.Run("Runes", func(t *testing.T) {
        before := `1
2
3
4
5
6
7
8
9
`
        after := `10
`

        dmp := diffmatchpatch.New()
        txt1, txt2, lines := dmp.DiffLinesToRunes(string(before), string(after))
        diff := dmp.DiffMainRunes(txt1, txt2, false)
        diff = dmp.DiffCharsToLines(diff, lines)

        var foundBefore, foundAfter string
        for _, d := range diff {
            switch d.Type {
            case eq:
                foundBefore += d.Text
                foundAfter += d.Text
            case del:
                foundBefore += d.Text
            case ins:
                foundAfter += d.Text
            }
        }

        if foundBefore != before {
            t.Errorf("Expected before %q; found %q", before, foundBefore)
        }
        if foundAfter != after {
            t.Errorf("Expected after %q; found %q", after, foundAfter)
        }
    })
}
schroederc commented 1 year ago

I sent out https://github.com/sergi/go-diff/pull/141 to revert the implementation to the limited, but not incorrect, rune approach. Some more fundamental changes would probably be preferable if someone has more time to work on it.

ernestrc commented 11 months ago

Ran into this as well. I don't need to diff large chunks (context: https://github.com/sergi/go-diff/issues/89#issuecomment-591376325) so I downgraded to v1.1.0, which seems to be, still to this day, widely used (i.e. go-git).

fcharlie commented 9 months ago

Is there any progress on this issue?