golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.03k stars 17.44k forks source link

cmd/trace, x/exp/trace: nondeterministic output, seems related to timestamp tie-breaking #68277

Closed rhysh closed 3 weeks ago

rhysh commented 1 month ago
$ ~/sdk/go1.23/bin/go version
go version go1.23rc1 darwin/arm64

(This also affects go1.22.4.)

After preparing an execution trace like so ...

$ ~/sdk/go1.23/bin/go test net/http -run='^$' -bench='BenchmarkClientServerParallel/64/h2' -benchtime=100ms -trace=/tmp/trace123_h2

... I read it back with go tool trace -d=1 (and with other tools based on golang.org/x/exp/trace). The results vary each time due to differences in the reported timestamp for some events. The events in question have sequential timestamps, which makes me suspect there's a tie which is getting resolved when reading back the trace, exposing some nondeterminism from map iteration order.

I expected the results of go tool trace -d=1 to be identical each time for a given execution trace. I expected that reading an execution trace with a golang.org/x/exp/trace.Reader and passing each Event to fmt.Print would give identical results on each run. I expected that loading an execution trace twice with golang.org/x/exp/trace.Reader and advancing to a particular timestamp would result in seeing the "same" Event.

CC @golang/runtime @mknyszek

Note that the contents of Events 2 and 3 are swapped between the first and second run, with the timestamps remaining monotonic.

$ ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 | grep 8653440991110
M=8191703744 P=4 G=-1 StateTransition Time=865344099111040 Resource=Goroutine(20062) Reason="" GoID=20062 Runnable->Running
M=6163394560 P=1 G=21401 StateTransition Time=865344099111041 Resource=Goroutine(28134) Reason="" GoID=28134 Waiting->Runnable
M=6165688320 P=5 G=28541 StateTransition Time=865344099111042 Resource=Goroutine(28541) Reason="select" GoID=28541 Running->Waiting

$ ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 | grep 8653440991110
M=8191703744 P=4 G=-1 StateTransition Time=865344099111040 Resource=Goroutine(20062) Reason="" GoID=20062 Runnable->Running
M=6165688320 P=5 G=28541 StateTransition Time=865344099111041 Resource=Goroutine(28541) Reason="select" GoID=28541 Running->Waiting
M=6163394560 P=1 G=21401 StateTransition Time=865344099111042 Resource=Goroutine(28134) Reason="" GoID=28134 Waiting->Runnable

With 930k events, several thousand will end up as additional/non-unique lines between two runs of go tool trace -d=1.

$ ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 | grep '^M=' | wc -l
  930838
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort -u | wc -l
  934396
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort -u | wc -l
  933605
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort -u | wc -l
  938976
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
5618 1
928029 2
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
4276 1
928700 2
$ (~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2 ; ~/sdk/go1.23/bin/go tool trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
5630 1
928023 2
gabyhelp commented 1 month ago

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

rhysh commented 1 month ago

Yes, this appears due to map iteration in https://github.com/golang/go/blob/go1.23rc1/src/internal/trace/reader.go#L145

Forcing a sort (such as with slices.Sorted(maps.Keys(r.gen.batches)), for newer toolchains) resolves it. I can prepare a patch.

$ git -C /usr/local/go diff --stat
 src/internal/trace/reader.go | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

$ (go run cmd/trace -d=1 /tmp/trace123_h2 ; go run cmd/trace -d=1 /tmp/trace123_h2) | grep '^M=' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
930838 2
gopherbot commented 1 month ago

Change https://go.dev/cl/596355 mentions this issue: internal/trace: make Reader output deterministic

thanm commented 1 month ago

Moving this to the 1.23 milestone, please comment if any issues with that.

rhysh commented 1 month ago

Thanks. Yes, the problem starts in std's internal/trace (not the generated copy in x/exp/trace) so a release milestone is right. Go 1.23 would be great if the door is still open for small fixes like this. Otherwise 1.24.

I plan to send a CL for x/exp/trace (with a "Fixes" line) once internal/trace has https://go.dev/cl/596355, independent of the main Go release cycle.

mknyszek commented 1 month ago

Thanks for sending a fix!

I think whether this fix should go into Go 1.23 depends on whether we would backport it anyway in the Go 1.24 dev cycle. I'm not totally sure that we would, because the issue doesn't actually impact most cmd/trace functionality (?) except for the debug output, so the severity isn't very high.

That being said, I do think this is an important property of the trace parser to maintain, and I don't want any blockers for the public parser API sitting around too long. FWIW, I think golang.org/x/exp/trace could definitely get updated with this patch before it lands on tip.

Here's what I propose:

Thoughts?

gopherbot commented 1 month ago

Change https://go.dev/cl/597177 mentions this issue: trace: regenerate to include CL 596355

rhysh commented 1 month ago

That sounds like a fine plan, yes. My own usage depends on x/exp getting the fix, and not at all on it landing in the core repo before the tree opens for Go 1.24.

The severity in cmd/trace is pretty low. Some day I'd like to be able to deep-link into the go tool trace UI, with the ability to target as closely as individual events .. but I'm not doing that yet, so I'm not affected.

I wonder though if all trace parsers should be able to agree on the full details of all events (including timestamp), and thus if there should be a more principled tie-breaker, rather than forcing the other maintainers ( @dominikh , @felixge , maybe others?) to be bug-compatible with this one.

mknyszek commented 1 month ago

and thus if there should be a more principled tie-breaker

I agree, I think that's something that should be documented in a wire format doc (once that exists).

rhysh commented 3 weeks ago

Fixed in x/exp, fixed in internal/trace.