Open myitcv opened 3 years ago
I added some quick debug printing, and it looks like you're just butting heads with the double loop in myers.Diff
. It isn't surprising that, for an input of many megabytes, O(n*n) results in OOM. I run out of memory at about line 38k with 16GiB of system memory, and there are a total of 130k lines. That seems to suggest you'd need at least 64GiB to finish :)
I think the right solution is to drop the quadratic behavior. This is already mentioned in the godoc:
// Diff calculates an edit.Script for ab using the Myers diff algorithm.
// This implementation uses the algorithm described in the first half
// of Myers' paper, which requires quadratric space.
// (An implementation of the linear space version is forthcoming.)
//
// Because diff calculation can be expensive, Myers supports cancellation via ctx.
func Diff(ctx context.Context, ab Pair) edit.Script {
Russ also hints at that here: https://github.com/golang/go/issues/45200#issuecomment-806171109
I'm not sure if @josharian plans to work on this soon (I imagine not) or how soon Russ's will be ready, since he said "next month".
Until either of those, I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.
I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout.
Then we need to fix testscript
😄
Just wondering if anyone is aware of any progress on the issues/topics related to this one?
Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.
Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction.
Sadly I don't see myself having the time. But will tweet out in case someone else is interested. A fun little project.
Hi there :wave:. FYI, Russ has a new diff implementation in CL 384255. It is an implementation of patience diff. It runs in O(n log n) time, and I think his version might be O(n + m) space.
I copied it here, and added a trivial cmd to be able run it on files:
https://github.com/thepudds/patience-diff
For the example from this issue, it runs in 0.07 sec and uses 38 MB RSS, vs. pkg/diff runs in 19 sec and 18 GB RSS (roughly ~250x slower and ~500x more RAM).
patience-diff:
$ go install github.com/thepudds/patience-diff/cmd/patience-diff@latest
$ /usr/bin/time -v patience-diff got want > patience-diff.out
[...]
User time (seconds): 0.06
System time (seconds): 0.01
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
Maximum resident set size (kbytes): 38876
pkg/diff repro:
$ /usr/bin/time -v ./mod.com got want > pkg-diff.out
[...]
User time (seconds): 10.79
System time (seconds): 8.90
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.38
Maximum resident set size (kbytes): 18085500
The patience-diff version is not a minimal diff:
$ wc -l *.out
96935 patience-diff.out
56391 pkg-diff.out
...but at least in theory, it is often friendlier to humans.
A better location for this might be rogpeppe/go-internal (which is an accurate name for where this came from, and would eliminate an extra dependency from rogpeppe/go-internal, and the more reasons there are to go to rogpeppe/go-internal then the more testscripts probably spreads, which is a good thing...). I'd be interested in sending a PR there if there is interest.
Alternatively, pkg/diff could be updated to use this (perhaps as a parallel package to the meyers package), but I think that might be more work?
And of course, people could use the copy at thepudds/patience-diff, but that might mean random gophers having to wonder whether or not they can trust this "thepudds" character 😅.
Finally, I should mention that Russ' CL hasn't been merged yet.
All that said, there is also https://github.com/golang/go/issues/45200 -- presumably that will be a time and space efficient algorithm. If that ends up happening soon, then perhaps not much benefit having this elsewhere.
I'm mostly AFK for the next week. But at a high level, I'd be absolutely delighted to add a patience diff implementation here, as long as Russ is OK with it. I've been meaning to write one myself, but it never got to the top of the stack. Ideally it would be just like the myers diff one, that is, in its own package, with a similar API, etc. And then update the default to be patience instead of myers.
random gophers having to wonder whether or not they can trust this "thepudds" character
Those in the know know that they can. :)
Hi @josharian, I’ll take a closer look at the API that the myers package presents here, and see if I can send a PR. No guarantees though. And agreed it makes sense to confirm with Russ before getting too far.
For those following along, https://pkg.go.dev/github.com/rogpeppe/go-internal/diff@master exists now :)
To reproduce:
Above,
repro.zip
is a reference to: repro.zip