Closed bench closed 6 years ago
This is not just using reflect. This similar program shows the same problem using an ordinary select
statement.
package main
import (
"fmt"
)
func main() {
in1 := make(chan int, 15000)
in2 := make(chan int, 15000)
for i := 0; i < 15000; i++ {
in1 <- 1
}
for i := 0; i < 15000; i++ {
in2 <- 2
}
out := make(chan int)
go func() {
c1 := make(chan struct{})
c2 := make(chan struct{})
for {
var v int
select {
case <-c1:
case <-c2:
case v = <-in2:
case v = <-in1:
}
out <- v
}
}()
var nb1, nb2 int = 0, 0
ConsumerLoop:
for {
select {
case d := <-out:
switch d {
case 1:
nb1++
if nb1 > 5000 { // stop consume. At this point, messages should be the equidistribution
break ConsumerLoop
}
case 2:
nb2++
default:
panic(d)
}
}
}
print(nb1, "\n")
print(nb2, "\n")
if !areSameOrder(nb1, nb2) {
fmt.Printf("Should have as many nb1 as nb2, got %d nb1 and %d nb2\n", nb1, nb2)
}
}
func areSameOrder(a, b int) bool {
ratio := float32(a) / float32(b)
return ratio < 1.3 && ratio > 0.7
}
It seems to be a problem with fastrandn
. If I replace this line in runtime/select.go:
j := fastrandn(uint32(i + 1))
with the line that was in 1.8:
j := int(fastrand()) % (i + 1)
then the program runs correctly.
CC @josharian
also not fair for
select {
case v = <-in2:
case v = <-in1:
case v = <-in2:
case v = <-in1:
}
the ratio is constantly larger than 1.08.
The ratio for the following is even worse, about 2.0
select {
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in1:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
case v = <-in2:
}
The problem is one of correlation.
package main
import "fmt"
var state uint32 = 1
func fastrand() uint32 {
fr := state
mx := uint32(int32(fr)>>31) & 0xa8888eef
fr = fr<<1 ^ mx
state = fr
return fr
}
func fastrandn(n uint32) uint32 {
return uint32(uint64(fastrand()) * uint64(n) >> 32)
}
func main() {
const N = 4
var hist [N][N]int
for i := 0; i < 1000000; i++ {
x := fastrandn(N)
y := fastrandn(N)
hist[x][y]++
}
for i := 0; i < N; i++ {
for j := 0; j < N; j++ {
fmt.Printf("%6d ", hist[i][j])
}
fmt.Println()
}
}
Outputs:
124831 124610 0 0
0 0 125374 125232
0 0 124870 124900
125083 125100 0 0
So if we get a 0
from fastrandn(4)
, the next output is guaranteed to be a 0
or a 1
.
I think this is fundamentally a problem with this RNG when N is a power of two, because the values we return are highly predictive of whether we end up xoring in the feedback or not.
Maybe if we did the feedback in the other direction (shift right, xor based on the low bit), it would help. Or maybe we have to go back to using modulo so we can depend on the low bits.
CC @funny-falcon
@randall77 looks like it is flaw of LFSR. Old version were not better:
var state uint32 = 1
func fastrand() uint32 {
fr := state
fr <<= 1
fr ^= uint32(int32(fr)>>31) & 0x88888eef
state = fr
return fr
}
func fastrandn(n uint32) uint32 {
return uint32(uint64(fastrand()) * uint64(n) >> 31)
}
125097 124561 0 0
0 0 124738 125274
125113 125123 0 0
0 0 124721 125373
I'm thinking what could be done with.
What about using xorshift instead?
var state uint32 = 2463534242
func fastrand() uint32 {
fr := state
fr ^= fr<<13
fr ^= fr>>17
fr ^= fr<<5
state = fr
return fr
}
Outputs:
62797 62420 62588 62303
62401 62408 61990 62399
62207 62601 62742 62515
62916 62716 62655 62342
@rasky , xorshift is slower. Otherwise it is good because it gives almost same randomness for high and low bits.
I have other variant for fastrandn
:
func fastrandn(n uint32) uint32 {
// some constants with random bit pattern
// (this are from hash32.go)
const m1 = 3168982561
const m2 = 3339683297
fr := fastrand()
fr ^= m1
fr *= m2
return uint32(uint64(fr) * uint64(n) >> 32)
}
It also gives fair distribution
100192 100322 99857 100461
99535 100347 100216 100209
99533 100023 99899 100376
99678 99796 99982 99574
And it is faster than xorshift.
But it uses multiplication, and doesn't fix cases when raw fastrand
used.
Benchmark:
$ GOMAXPROCS=1 ../bin/go test runtime -run 'foo' -bench '.*astrand.*'
BenchmarkFastrand 1000000000 2.35 ns/op
BenchmarkFastrandHashiter 50000000 27.3 ns/op
BenchmarkFastrandn/2 1000000000 2.39 ns/op
BenchmarkFastrandn/3 1000000000 2.39 ns/op
BenchmarkFastrandn/4 1000000000 2.39 ns/op
BenchmarkFastrandn/5 1000000000 2.39 ns/op
BenchmarkFastrand 500000000 3.16 ns/op
BenchmarkFastrandHashiter 50000000 33.9 ns/op
BenchmarkFastrandn/2 500000000 3.34 ns/op
BenchmarkFastrandn/3 500000000 3.34 ns/op
BenchmarkFastrandn/4 500000000 3.34 ns/op
BenchmarkFastrandn/5 500000000 3.34 ns/op
fastrand
in this way:
func fastrand() uint32 {
mp := getg().m
fr := mp.fastrand
mx := uint32(int32(fr)>>31) & 0xa8888eef
fr = fr<<1 ^ mx
mp.fastrand = fr
fr ^= 3168982561
fr *= 3339683297
return fr ^ (fr >> 16)
}
BenchmarkFastrand 1000000000 2.34 ns/op
BenchmarkFastrandHashiter 50000000 34.3 ns/op
BenchmarkFastrandn/2 1000000000 2.56 ns/op
BenchmarkFastrandn/3 1000000000 2.56 ns/op
BenchmarkFastrandn/4 1000000000 2.56 ns/op
BenchmarkFastrandn/5 1000000000 2.56 ns/op
Is 1ns slower important for fastrand? Does it ever use a noticeable amount of cpu in a hot loop? Xorshift is among the fastest with good properties; you can come up with endless number of faster prngs with worse properties, but I fear we might end up in a similar bug some day.
My variant is definitely good for all of fastrand
intended uses. And it certainly fixes fastrandn
.
If you want, I'll pass it through dieharder, and compare with xorshift.
I think, performance should be measured on all supported platforms. Something that is faster on intel, could be slower on Power or Arm.
Yes, xorshift on ARM is basically 3 instructions thanks to barrel shift, hard to beat. I still think that 1ns in fastrand is not going to have any real world impact, though.
By adapting xorshift+ to 32-bit integers, using a triplet picked up by the original xorshift paper, I tested this:
func fastrand_xorshiftplus() uint32 {
// WARNING: untested for statistical properties
s0, s1 := state1, state0
s1 ^= s1 << 5
s1 = s1 ^ s0 ^ (s1 >> 17) ^ (s0 >> 13)
state0, state1 = s0, s1
return s0 + s1
}
which is faster than the original fastrand on my i7 skylake:
BenchmarkOrig-4 1000000000 2.18 ns/op
BenchmarkXorShift-4 1000000000 2.89 ns/op
BenchmarkXorShiftPlus-4 2000000000 1.94 ns/op
Notice that the triplet choice can be totally wrong, I've just used it to test speed, but we should probably try BigCrush on it, and maybe select a different triplet.
I supposed that 2*32bit
version will be faster. But will fastrand increase it's state from 1*32bit
to 2*32bit
?
I hope, yes.
Why you didn't take triplet from xorshift pdf ? https://www.jstatsoft.org/article/view/v008i14/xorshift.pdf
For n = 64, a, b, c = [10, 13, 10], [8, 9, 22], [2, 7, 3], [23, 3, 24]
s1 ^= s1 << a
s1 ^= s0 ^ (s1 >> b) ^ (s0 >> c)
Looks like tripple [10, 13, 10] doesn't pass diehard. Probably, it was mistaken. Triplet [10, 10, 13] passes, but could not be trusted.
I did take it from xorshift's pdf: [5,17,13] is one of them, valid for 32-bit numbers. The period of this xorshift+ 32-bit version (with two 32-bit words of state) should be 2^64-1.
5,17,13 - is for 1*32 bit generator. Look at "3.1 Binary vector spaces of dimension n = 96, 128, 160 . . .." in that pdf
shifts for 1*32bit
version does not produce full period for 2*32bit
generator. I know what I'm talking about, please believe me.
Yes, I see, you're right.
I had an experince with running xorshift framework to generate all correct triplets. It is not too hard. At least, pdf tripplets should be checked, cause [10,13,10] is cleanly wrong (it has quite short period with seed 1,0).
I tried [23,3,24], it passes SmallCrush but not Crush:
Test p-value
----------------------------------------------
1 SerialOver, t = 2 eps
2 SerialOver, t = 4 eps
6 CollisionOver, t = 4 2.1e-25
7 CollisionOver, t = 8 7.1e-63
12 BirthdaySpacings, t = 3 1.2e-273
13 BirthdaySpacings, t = 4 8.3e-26
15 BirthdaySpacings, t = 7 eps
16 BirthdaySpacings, t = 8 eps
20 ClosePairs NP, t = 7 4.9e-7
20 ClosePairs mNP, t = 7 1.9e-27
20 ClosePairs mNP1, t = 7 7.1e-30
20 ClosePairs mNP2, t = 7 1.1e-10
20 ClosePairs NJumps, t = 7 3.1e-34
36 Run of U01, r = 15 eps
38 Permutation, r = 15 eps
65 RandomWalk1 H (L = 90) 3.3e-16
65 RandomWalk1 M (L = 90) 1.5e-5
72 LinearComp, r = 29 1 - eps1
87 HammingIndep, L = 300 2.1e-4
----------------------------------------------
All other tests were passed
(full log)
I was mistaken for [10,13,10] - i got error in my test program.
So triplet from pdf could be trusted.
I vote for [10,13,10] triplet.
All triplets for xorshift64 using 2*32bit
state found by xorshift framework
You see: it contains all example triplets from pdf.
And I think, there is no need in xorshift+, because xorshift is already enough for fastrand use case.
With xorshift64 (2*32
) I got following (number differs from test above cause it is other computer):
master
goos: linux
goarch: amd64
pkg: runtime
BenchmarkFastrand 500000000 3.86 ns/op
BenchmarkFastrandHashiter 30000000 43.0 ns/op
BenchmarkFastrandn/2 500000000 3.72 ns/op
BenchmarkFastrandn/3 500000000 3.73 ns/op
BenchmarkFastrandn/4 500000000 3.72 ns/op
BenchmarkFastrandn/5 500000000 3.73 ns/op
PASS
ok runtime 12.628s
xorshift
goos: linux
goarch: amd64
pkg: runtime
BenchmarkFastrand 500000000 3.73 ns/op
BenchmarkFastrandHashiter 30000000 53.1 ns/op
BenchmarkFastrandn/2 500000000 3.89 ns/op
BenchmarkFastrandn/3 500000000 3.88 ns/op
BenchmarkFastrandn/4 500000000 3.89 ns/op
BenchmarkFastrandn/5 500000000 3.88 ns/op
PASS
ok runtime 13.250s
xorshift64+ is just a bit slower than xorshift64, but I don't think we need that "extra quality".
I leave priority to open changset to @rasky.
Change https://golang.org/cl/62530 mentions this issue: runtime: improve fastrand with a better generator
Reopening for backport to 1.9.1.
@ianlancetaylor @randall77 should I backport the patch as-is or you prefer a simpler version (and if so, what would you simplify)?
I like the CL as is.
Change https://golang.org/cl/64193 mentions this issue: [release-branch.go1.9] runtime: improve fastrand with a better generator
Fix for sema.go https://go-review.googlesource.com/c/go/+/68050
Change https://golang.org/cl/68050 mentions this issue: runtime: fix using fastrand in sema.go
If we cherry-pick CL 64193 then we must also cherry-pick CL 68050, because CL 64193 breaks a different property of fastrand.
I'm pretty skeptical of wholesale-replacing the random number generator and then putting it into a point release when we just found a serious problem with it 9 days ago.
Maybe for Go 1.9.2 we should just fix fastrandn to use a % again. The move away from % got this performance win:
name old time/op new time/op delta
SelectUncontended-8 33.7ns ± 2% 33.9ns ± 2% +0.70% (p=0.000 n=49+50)
SelectSyncContended-8 1.68µs ± 4% 1.65µs ± 4% -1.54% (p=0.000 n=50+45)
SelectAsyncContended-8 282ns ± 1% 277ns ± 1% -1.50% (p=0.000 n=48+43)
SelectNonblock-8 5.31ns ± 1% 5.32ns ± 1% ~ (p=0.275 n=45+44)
SelectProdCons-8 585ns ± 3% 577ns ± 2% -1.35% (p=0.000 n=50+50)
GoroutineSelect-8 1.59ms ± 2% 1.59ms ± 1% ~ (p=0.084 n=49+48)
I'm happy to give that back for Go 1.9.2. The uses of fastrandn elsewhere in the runtime also look like they would appreciate not having terrible correlations, so I don't think we should change only select's use.
Closing this and remilestoning Go 1.10. The fix for Go 1.9.2 is #22253.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes, on 1.9
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64" GOBIN="" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/bchenebault/DEV/git-oab.si.fr.intraorange/BU900102/ceo-be" GORACE="" GOROOT="/usr/local/go" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build913166031=/tmp/go-build -gno-record-gcc-switches" CXX="g++" CGO_ENABLED="1" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config"
What did you do?
A Go select statement with reflect package function
reflect.Select(...)
should block until at least one of the cases can proceed and make a uniform pseudo-random choice. It appears that the behavior has changed since returned values are no longer pseudo-random in specific cases only.Our case is the following : 2 signaling channels preceding 2 "logic" channels https://play.golang.org/p/0TiMtsRk03
But the same version with only one channel preceding the 2 others works https://play.golang.org/p/sy4RjEEjdg
This case perfectly works on go1.8.3 and below.
What did you expect to see?
A pseudo random distribution
What did you see instead?
A strange distribution, surely not pseudo random