pkg / diff

BSD 3-Clause "New" or "Revised" License
88 stars 9 forks source link

diff.Text OOMs with large different inputs #26

Open myitcv opened 3 years ago

myitcv commented 3 years ago
$ go version
go version devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/tmp.uHtLe5z1Zd/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build267893791=/tmp/go-build -gno-record-gcc-switches"

To reproduce:

cd $(mktemp -d)
curl -sL https://github.com/pkg/diff/files/6221280/repro.zip -o repro.zip
unzip repro.zip
go run main.go

Above, repro.zip is a reference to: repro.zip

mvdan commented 3 years ago

I added some quick debug printing, and it looks like you're just butting heads with the double loop in myers.Diff. It isn't surprising that, for an input of many megabytes, O(n*n) results in OOM. I run out of memory at about line 38k with 16GiB of system memory, and there are a total of 130k lines. That seems to suggest you'd need at least 64GiB to finish :)

I think the right solution is to drop the quadratic behavior. This is already mentioned in the godoc:

// Diff calculates an edit.Script for ab using the Myers diff algorithm.
// This implementation uses the algorithm described in the first half
// of Myers' paper, which requires quadratric space.
// (An implementation of the linear space version is forthcoming.)
//
// Because diff calculation can be expensive, Myers supports cancellation via ctx.
func Diff(ctx context.Context, ab Pair) edit.Script {

Russ also hints at that here: https://github.com/golang/go/issues/45200#issuecomment-806171109

I'm not sure if @josharian plans to work on this soon (I imagine not) or how soon Russ's will be ready, since he said "next month".

Until either of those, I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.

myitcv commented 3 years ago

I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.

Then we need to fix testscript 😄

https://github.com/rogpeppe/go-internal/blob/8ef127344c684f0eb23e5eea40f303b74de097b2/testscript/cmd.go#L145

myitcv commented 2 years ago

Just wondering if anyone is aware of any progress on the issues/topics related to this one?

josharian commented 2 years ago

Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.

myitcv commented 2 years ago

Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.

Sadly I don't see myself having the time. But will tweet out in case someone else is interested. A fun little project.

thepudds commented 2 years ago

Hi there :wave:. FYI, Russ has a new diff implementation in CL 384255. It is an implementation of patience diff. It runs in O(n log n) time, and I think his version might be O(n + m) space.

I copied it here, and added a trivial cmd to be able run it on files:

https://github.com/thepudds/patience-diff

For the example from this issue, it runs in 0.07 sec and uses 38 MB RSS, vs. pkg/diff runs in 19 sec and 18 GB RSS (roughly ~250x slower and ~500x more RAM).

patience-diff:

$ go install github.com/thepudds/patience-diff/cmd/patience-diff@latest

$ /usr/bin/time -v patience-diff got want > patience-diff.out
        [...]
        User time (seconds): 0.06
        System time (seconds): 0.01
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
        Maximum resident set size (kbytes): 38876

pkg/diff repro:

$ /usr/bin/time -v ./mod.com got want > pkg-diff.out
        [...]
        User time (seconds): 10.79
        System time (seconds): 8.90
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.38
        Maximum resident set size (kbytes): 18085500

The patience-diff version is not a minimal diff:

$ wc -l *.out
96935 patience-diff.out
56391 pkg-diff.out

...but at least in theory, it is often friendlier to humans.

A better location for this might be rogpeppe/go-internal (which is an accurate name for where this came from, and would eliminate an extra dependency from rogpeppe/go-internal, and the more reasons there are to go to rogpeppe/go-internal then the more testscripts probably spreads, which is a good thing...). I'd be interested in sending a PR there if there is interest.

Alternatively, pkg/diff could be updated to use this (perhaps as a parallel package to the meyers package), but I think that might be more work?

And of course, people could use the copy at thepudds/patience-diff, but that might mean random gophers having to wonder whether or not they can trust this "thepudds" character 😅.

Finally, I should mention that Russ' CL hasn't been merged yet.

thepudds commented 2 years ago

All that said, there is also https://github.com/golang/go/issues/45200 -- presumably that will be a time and space efficient algorithm. If that ends up happening soon, then perhaps not much benefit having this elsewhere.

josharian commented 2 years ago

I'm mostly AFK for the next week. But at a high level, I'd be absolutely delighted to add a patience diff implementation here, as long as Russ is OK with it. I've been meaning to write one myself, but it never got to the top of the stack. Ideally it would be just like the myers diff one, that is, in its own package, with a similar API, etc. And then update the default to be patience instead of myers.

random gophers having to wonder whether or not they can trust this "thepudds" character

Those in the know know that they can. :)

thepudds commented 2 years ago

Hi @josharian, I’ll take a closer look at the API that the myers package presents here, and see if I can send a PR. No guarantees though. And agreed it makes sense to confirm with Russ before getting too far.

mvdan commented 1 year ago

For those following along, https://pkg.go.dev/github.com/rogpeppe/go-internal/diff@master exists now :)