testing: Inconsistent benchmark data when GOMAXPROCS=1

What version of Go are you using (`go version`)?

$ go version
go version go1.12.4 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/xuanjiazhen/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/xuanjiazhen/work"
GOPROXY=""
GORACE=""
GOROOT="/Users/xuanjiazhen/Desktop/go"
GOTMPDIR=""
GOTOOLDIR="/Users/xuanjiazhen/Desktop/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/w3/64tyfz3n4cs2jw5lrl7mc5c80000gn/T/go-build723171853=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

$GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarge$ -count 10
goos: darwin
goarch: amd64
pkg: runtime
BenchmarkConvT2ILarge   50000000                46.8 ns/op
BenchmarkConvT2ILarge   30000000                46.9 ns/op
BenchmarkConvT2ILarge   50000000                39.5 ns/op
BenchmarkConvT2ILarge   30000000                39.1 ns/op
BenchmarkConvT2ILarge   30000000                47.6 ns/op
BenchmarkConvT2ILarge   30000000                46.8 ns/op
BenchmarkConvT2ILarge   30000000                49.6 ns/op
BenchmarkConvT2ILarge   30000000                43.6 ns/op
BenchmarkConvT2ILarge   30000000                52.2 ns/op
BenchmarkConvT2ILarge   50000000                46.9 ns/op
PASS
ok      runtime 16.874s
$GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarge -count 10
goos: darwin
goarch: amd64
pkg: runtime
BenchmarkConvT2ILarge   100000000               19.6 ns/op
BenchmarkConvT2ILarge   100000000               19.8 ns/op
BenchmarkConvT2ILarge   100000000               20.3 ns/op
BenchmarkConvT2ILarge   100000000               21.2 ns/op
BenchmarkConvT2ILarge   100000000               21.0 ns/op
BenchmarkConvT2ILarge   100000000               20.6 ns/op
BenchmarkConvT2ILarge   100000000               21.1 ns/op
BenchmarkConvT2ILarge   100000000               20.6 ns/op
BenchmarkConvT2ILarge   100000000               20.6 ns/op
BenchmarkConvT2ILarge   100000000               20.8 ns/op
PASS
ok      runtime 21.803s

What did you expect to see?

The test results of the two types of writing should be consistent.

What did you see instead?

Test results with $ are much worse than test results without it.

$GODEBUG=gctrace=1 GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarge -count 10
...
gc 3984 @22.195s 2%: 0.001+0.12+0.001 ms clock, 0.001+0.12/0/0+0.001 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3985 @22.202s 2%: 0.003+0.13+0.001 ms clock, 0.003+0.12/0/0+0.001 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3986 @22.207s 2%: 0.001+0.11+0.001 ms clock, 0.001+0.11/0/0+0.001 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3987 @22.213s 2%: 0.001+0.11+0 ms clock, 0.001+0.11/0/0+0 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3988 @22.218s 2%: 0.001+0.12+0.001 ms clock, 0.001+0.12/0/0+0.001 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3989 @22.224s 2%: 0.001+0.12+0 ms clock, 0.001+0.12/0/0+0 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3990 @22.230s 2%: 0+0.12+0 ms clock, 0+0.12/0/0+0 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
gc 3991 @22.236s 2%: 0.001+0.14+0.001 ms clock, 0.001+0.14/0/0+0.001 ms cpu, 4->4->0 MB, 5 MB goal, 1 P
100000000               22.8 ns/op
PASS

$GODEBUG=gctrace=1 GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarge$ -count 10
...
gc 1052 @18.124s 0%: 0.001+20+0.002 ms clock, 0.001+0.13/0.021/0+0.002 ms cpu, 6->11->5 MB, 10 MB goal, 1 P
gc 1053 @18.146s 0%: 0.001+21+0.001 ms clock, 0.001+0.11/0.013/0+0.001 ms cpu, 6->12->6 MB, 11 MB goal, 1 P
gc 1054 @18.169s 0%: 0.001+19+0.001 ms clock, 0.001+0.11/0.014/0+0.001 ms cpu, 7->12->5 MB, 12 MB goal, 1 P
gc 1055 @18.190s 0%: 0.001+23+0.001 ms clock, 0.001+0.11/0.015/0+0.001 ms cpu, 6->12->6 MB, 10 MB goal, 1 P
gc 1056 @18.216s 0%: 0.002+19+0.001 ms clock, 0.002+0.13/0.013/0+0.001 ms cpu, 7->12->5 MB, 12 MB goal, 1 P
gc 1057 @18.237s 0%: 0.002+19+0.001 ms clock, 0.002+0.13/0.017/0+0.001 ms cpu, 6->11->5 MB, 10 MB goal, 1 P
gc 1058 @18.258s 0%: 0.001+20+0.002 ms clock, 0.001+0.11/0.020/0+0.002 ms cpu, 6->12->6 MB, 11 MB goal, 1 P
gc 1059 @18.281s 0%: 0.001+19+0.001 ms clock, 0.001+0.11/0.016/0+0.001 ms cpu, 7->12->5 MB, 12 MB goal, 1 P
gc 1060 @18.302s 0%: 0.001+19+0.002 ms clock, 0.001+0.11/0.017/0+0.002 ms cpu, 6->11->5 MB, 11 MB goal, 1 P
gc 1061 @18.323s 0%: 0.002+11+0.002 ms clock, 0.002+0.13/0.023/0+0.002 ms cpu, 6->8->2 MB, 10 MB goal, 1 P
50000000                51.7 ns/op
PASS
ok      runtime 18.342s
gc 21 @19.662s 0%: 0.015+1.6+0.002 ms clock, 0.015+0.099/0.52/0.90+0.002 ms cpu, 4->4->1 MB, 5 MB goal, 1 P

Thanks for the report.

I can't reproduce this using 1.12.4 on linux/amd64, the two results are statistically the same (with vs without $):

$ benchstat with.txt without.txt 

name          old time/op  new time/op  delta
ConvT2ILarge  28.0ns ± 1%  28.0ns ± 1%   ~     (p=0.914 n=9+9)

Can you reproduce this consistently? Are you running the benchmarks on a very quiet machine with no other processes using the cpu?

Thanks for the report.

I can't reproduce this using 1.12.4 on linux/amd64, the two results are statistically the same (with vs without $):
$ benchstat with.txt without.txt 

name          old time/op  new time/op  delta
ConvT2ILarge  28.0ns ± 1%  28.0ns ± 1%   ~     (p=0.914 n=9+9)
Can you reproduce this consistently? Are you running the benchmarks on a very quiet machine with no other processes using the cpu?

Yes,I can reproduce that consistently. Not only on my mac (go1.12.4), but also on x86 and arm64 linux (go1.11.5).They are all quiet enough.

Ah, I was able to reproduce this by doing more runs.

It doesn't happen every time, but when it happens the benchmark is indeed either 1) consistently slower 2) faster in the first iteration, slower from the 2nd.

Here 10 runs with ^BenchmarkConvT2ILarge$ and count=5

BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   30000000            52.5 ns/op
BenchmarkConvT2ILarge   30000000            59.9 ns/op
BenchmarkConvT2ILarge   50000000            40.5 ns/op
BenchmarkConvT2ILarge   30000000            59.6 ns/op


BenchmarkConvT2ILarge   50000000            41.6 ns/op
BenchmarkConvT2ILarge   30000000            59.8 ns/op
BenchmarkConvT2ILarge   50000000            53.0 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op
BenchmarkConvT2ILarge   30000000            61.5 ns/op


BenchmarkConvT2ILarge   50000000            40.9 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op
BenchmarkConvT2ILarge   30000000            61.7 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op


BenchmarkConvT2ILarge   50000000            27.7 ns/op
BenchmarkConvT2ILarge   50000000            28.2 ns/op
BenchmarkConvT2ILarge   50000000            28.0 ns/op
BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   50000000            27.9 ns/op


BenchmarkConvT2ILarge   50000000            55.8 ns/op
BenchmarkConvT2ILarge   30000000            61.8 ns/op
BenchmarkConvT2ILarge   30000000            61.9 ns/op
BenchmarkConvT2ILarge   30000000            61.7 ns/op
BenchmarkConvT2ILarge   30000000            59.9 ns/op


BenchmarkConvT2ILarge   50000000            28.0 ns/op
BenchmarkConvT2ILarge   30000000            57.9 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op
BenchmarkConvT2ILarge   30000000            61.9 ns/op
BenchmarkConvT2ILarge   30000000            61.6 ns/op


BenchmarkConvT2ILarge   50000000            27.8 ns/op
BenchmarkConvT2ILarge   50000000            57.2 ns/op
BenchmarkConvT2ILarge   50000000            51.5 ns/op
BenchmarkConvT2ILarge   30000000            61.9 ns/op
BenchmarkConvT2ILarge   30000000            59.7 ns/op


BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   30000000            52.2 ns/op
BenchmarkConvT2ILarge   30000000            59.5 ns/op
BenchmarkConvT2ILarge   30000000            61.4 ns/op
BenchmarkConvT2ILarge   30000000            61.4 ns/op


BenchmarkConvT2ILarge   50000000            47.5 ns/op
BenchmarkConvT2ILarge   30000000            59.6 ns/op
BenchmarkConvT2ILarge   30000000            59.7 ns/op
BenchmarkConvT2ILarge   50000000            51.5 ns/op
BenchmarkConvT2ILarge   30000000            61.8 ns/op


BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   50000000            27.7 ns/op
BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   50000000            27.9 ns/op
BenchmarkConvT2ILarge   50000000            27.7 ns/op

While with ^BenchmarkConvT2ILarge is always around 28 ns. This is on 1.12.4.

OTOH, tip is consistent, even with the $, but always at 40ns, which is slower than the 28ns in 1.12.

@xuanjiazhen Any chance you could test with tip (the current master) and confirm that the issue with the benchmark results being bimodal is indeed no longer present?

I tried it on the latest master branch.The result looks like the two become consistent.But they reached consistent in the poor result!!! I think it is still not normal. @ALTree
(The number of times is a bit strange, is it a new improvement?)

$go version
go version devel +d17d41e58d Mon Apr 22 14:45:16 2019 +0000 darwin/amd64
$GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarg
e -count 10
goos: darwin
goarch: amd64
pkg: runtime
BenchmarkConvT2ILarge   36258176            32.3 ns/op
BenchmarkConvT2ILarge   38381629            32.3 ns/op
BenchmarkConvT2ILarge   41117751            32.2 ns/op
BenchmarkConvT2ILarge   38234461            32.7 ns/op
BenchmarkConvT2ILarge   42858790            32.2 ns/op
BenchmarkConvT2ILarge   39587468            32.6 ns/op
BenchmarkConvT2ILarge   42929428            32.7 ns/op
BenchmarkConvT2ILarge   37120336            32.4 ns/op
BenchmarkConvT2ILarge   38552698            32.9 ns/op
BenchmarkConvT2ILarge   42029185            32.2 ns/op
PASS
ok      runtime 13.210s
$GOMAXPROCS=1 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarg
e$ -count 10
goos: darwin
goarch: amd64
pkg: runtime
BenchmarkConvT2ILarge   35498226            32.5 ns/op
BenchmarkConvT2ILarge   41142729            32.2 ns/op
BenchmarkConvT2ILarge   43043185            32.1 ns/op
BenchmarkConvT2ILarge   39090960            32.3 ns/op
BenchmarkConvT2ILarge   43341727            32.4 ns/op
BenchmarkConvT2ILarge   42690354            32.1 ns/op
BenchmarkConvT2ILarge   40104470            32.1 ns/op
BenchmarkConvT2ILarge   41782971            32.2 ns/op
BenchmarkConvT2ILarge   42257203            32.1 ns/op
BenchmarkConvT2ILarge   42138686            32.4 ns/op
PASS
ok      runtime 13.561s
$GOMAXPROCS=2 go test runtime -v -run none -bench  ^BenchmarkConvT2ILarg
e -count 10
goos: darwin
goarch: amd64
pkg: runtime
BenchmarkConvT2ILarge-2     53497600            20.2 ns/op
BenchmarkConvT2ILarge-2     60137784            20.0 ns/op
BenchmarkConvT2ILarge-2     60011299            20.1 ns/op
BenchmarkConvT2ILarge-2     60673358            20.2 ns/op
BenchmarkConvT2ILarge-2     59793987            20.1 ns/op
BenchmarkConvT2ILarge-2     60232131            20.0 ns/op
BenchmarkConvT2ILarge-2     60023706            20.1 ns/op
BenchmarkConvT2ILarge-2     60491013            20.1 ns/op
BenchmarkConvT2ILarge-2     60667680            19.9 ns/op
BenchmarkConvT2ILarge-2     60528424            20.0 ns/op
PASS
ok      runtime 12.187s

But they reached consistent in the poor result!!!

The bimodal distribution depending on the number of iterations is because the benchmark is not fully linear in the number of iterations, often due to GC interactions. This is unfortunate but hard to fix in the general case. Whether you hit the "poor result" or not depends on how many iterations you run. If you run with -benchtime=100ms or -benchtime=10ms or the like, you may get different results.

(The number of times is a bit strange, is it a new improvement?)

Yes. And in fact, one of the goals of that change was to avoid bimodal distributions in non-linear benchmarks. :)

golang / go