runtime: select is not fair

bench commented 7 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.9 linux/amd64

Does this issue reproduce with the latest release?

yes, on 1.9

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64" GOBIN="" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/bchenebault/DEV/git-oab.si.fr.intraorange/BU900102/ceo-be" GORACE="" GOROOT="/usr/local/go" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build913166031=/tmp/go-build -gno-record-gcc-switches" CXX="g++" CGO_ENABLED="1" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config"

What did you do?

A Go select statement with reflect package function reflect.Select(...) should block until at least one of the cases can proceed and make a uniform pseudo-random choice. It appears that the behavior has changed since returned values are no longer pseudo-random in specific cases only.

Our case is the following : 2 signaling channels preceding 2 "logic" channels https://play.golang.org/p/0TiMtsRk03

But the same version with only one channel preceding the 2 others works https://play.golang.org/p/sy4RjEEjdg

This case perfectly works on go1.8.3 and below.

What did you expect to see?

A pseudo random distribution

What did you see instead?

A strange distribution, surely not pseudo random

ianlancetaylor commented 7 years ago

This is not just using reflect. This similar program shows the same problem using an ordinary select statement.

package main

import (
    "fmt"
)

func main() {
    in1 := make(chan int, 15000)
    in2 := make(chan int, 15000)

    for i := 0; i < 15000; i++ {
        in1 <- 1
    }

    for i := 0; i < 15000; i++ {
        in2 <- 2
    }
    out := make(chan int)

    go func() {
        c1 := make(chan struct{})
        c2 := make(chan struct{})
        for {
            var v int
            select {
            case <-c1:
            case <-c2:
            case v = <-in2:
            case v = <-in1:
            }
            out <- v
        }
    }()

    var nb1, nb2 int = 0, 0

ConsumerLoop:
    for {
        select {
        case d := <-out:
            switch d {
            case 1:
                nb1++
                if nb1 > 5000 { // stop consume. At this point, messages should be the equidistribution
                    break ConsumerLoop
                }
            case 2:
                nb2++
            default:
                panic(d)
            }
        }
    }

    print(nb1, "\n")
    print(nb2, "\n")

    if !areSameOrder(nb1, nb2) {
        fmt.Printf("Should have as many nb1 as nb2, got %d nb1 and %d nb2\n", nb1, nb2)
    }
}

func areSameOrder(a, b int) bool {
    ratio := float32(a) / float32(b)
    return ratio < 1.3 && ratio > 0.7
}

ianlancetaylor commented 7 years ago

It seems to be a problem with fastrandn. If I replace this line in runtime/select.go:

        j := fastrandn(uint32(i + 1))

with the line that was in 1.8:

        j := int(fastrand()) % (i + 1)

then the program runs correctly.

CC @josharian

zigo101 commented 7 years ago

also not fair for

            select {
            case v = <-in2:
            case v = <-in1:
            case v = <-in2:
            case v = <-in1:
            }

the ratio is constantly larger than 1.08.

The ratio for the following is even worse, about 2.0

            select {
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in1:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            case v = <-in2:
            }

randall77 commented 7 years ago

The problem is one of correlation.

package main

import "fmt"

var state uint32 = 1

func fastrand() uint32 {
    fr := state
    mx := uint32(int32(fr)>>31) & 0xa8888eef
    fr = fr<<1 ^ mx
    state = fr
    return fr
}

func fastrandn(n uint32) uint32 {
    return uint32(uint64(fastrand()) * uint64(n) >> 32)
}

func main() {
    const N = 4
    var hist [N][N]int
    for i := 0; i < 1000000; i++ {
        x := fastrandn(N)
        y := fastrandn(N)
        hist[x][y]++
    }
    for i := 0; i < N; i++ {
        for j := 0; j < N; j++ {
            fmt.Printf("%6d ", hist[i][j])
        }
        fmt.Println()
    }
}

Outputs:

124831 124610      0      0 
     0      0 125374 125232 
     0      0 124870 124900 
125083 125100      0      0

So if we get a 0 from fastrandn(4), the next output is guaranteed to be a 0 or a 1. I think this is fundamentally a problem with this RNG when N is a power of two, because the values we return are highly predictive of whether we end up xoring in the feedback or not.

Maybe if we did the feedback in the other direction (shift right, xor based on the low bit), it would help. Or maybe we have to go back to using modulo so we can depend on the low bits.

randall77 commented 7 years ago

CC @funny-falcon

funny-falcon commented 7 years ago

@randall77 looks like it is flaw of LFSR. Old version were not better:

var state uint32 = 1
func fastrand() uint32 {
    fr := state
    fr <<= 1
    fr ^= uint32(int32(fr)>>31) & 0x88888eef
    state = fr
    return fr
}                                                    
func fastrandn(n uint32) uint32 {
        return uint32(uint64(fastrand()) * uint64(n) >> 31)
}

125097 124561      0      0
     0      0 124738 125274
125113 125123      0      0
     0      0 124721 125373

I'm thinking what could be done with.

rasky commented 7 years ago

What about using xorshift instead?

var state uint32 = 2463534242

func fastrand() uint32 {
    fr := state
    fr ^= fr<<13
    fr ^= fr>>17
    fr ^= fr<<5
    state = fr
    return fr
}

Outputs:

 62797  62420  62588  62303 
 62401  62408  61990  62399 
 62207  62601  62742  62515 
 62916  62716  62655  62342

(https://play.golang.org/p/sD9jmKGWJy)

funny-falcon commented 7 years ago

@rasky , xorshift is slower. Otherwise it is good because it gives almost same randomness for high and low bits.

I have other variant for fastrandn:

func fastrandn(n uint32) uint32 {
    // some constants with random bit pattern
    // (this are from hash32.go)
    const m1 = 3168982561
    const m2 = 3339683297
    fr := fastrand()
    fr ^= m1
    fr *= m2
    return uint32(uint64(fr) * uint64(n) >> 32)
}

It also gives fair distribution

100192 100322  99857 100461
 99535 100347 100216 100209
 99533 100023  99899 100376
 99678  99796  99982  99574

And it is faster than xorshift. But it uses multiplication, and doesn't fix cases when raw fastrand used.

https://play.golang.org/p/29n_4OfVQq

funny-falcon commented 7 years ago

Benchmark:

master

$ GOMAXPROCS=1 ../bin/go test runtime -run 'foo' -bench '.*astrand.*'
BenchmarkFastrand               1000000000               2.35 ns/op
BenchmarkFastrandHashiter       50000000                27.3 ns/op
BenchmarkFastrandn/2            1000000000               2.39 ns/op
BenchmarkFastrandn/3            1000000000               2.39 ns/op
BenchmarkFastrandn/4            1000000000               2.39 ns/op
BenchmarkFastrandn/5            1000000000               2.39 ns/op

xorshift

BenchmarkFastrand               500000000                3.16 ns/op
BenchmarkFastrandHashiter       50000000                33.9 ns/op
BenchmarkFastrandn/2            500000000                3.34 ns/op
BenchmarkFastrandn/3            500000000                3.34 ns/op
BenchmarkFastrandn/4            500000000                3.34 ns/op
BenchmarkFastrandn/5            500000000                3.34 ns/op

fastrand in this way:

func fastrand() uint32 {
mp := getg().m
fr := mp.fastrand
mx := uint32(int32(fr)>>31) & 0xa8888eef
fr = fr<<1 ^ mx
mp.fastrand = fr
fr ^= 3168982561
fr *= 3339683297
return fr ^ (fr >> 16)
}

BenchmarkFastrand               1000000000               2.34 ns/op
BenchmarkFastrandHashiter       50000000                34.3 ns/op
BenchmarkFastrandn/2            1000000000               2.56 ns/op
BenchmarkFastrandn/3            1000000000               2.56 ns/op
BenchmarkFastrandn/4            1000000000               2.56 ns/op
BenchmarkFastrandn/5            1000000000               2.56 ns/op

rasky commented 7 years ago

Is 1ns slower important for fastrand? Does it ever use a noticeable amount of cpu in a hot loop? Xorshift is among the fastest with good properties; you can come up with endless number of faster prngs with worse properties, but I fear we might end up in a similar bug some day.

funny-falcon commented 7 years ago

My variant is definitely good for all of fastrand intended uses. And it certainly fixes fastrandn. If you want, I'll pass it through dieharder, and compare with xorshift.

funny-falcon commented 7 years ago

I think, performance should be measured on all supported platforms. Something that is faster on intel, could be slower on Power or Arm.

rasky commented 7 years ago

Yes, xorshift on ARM is basically 3 instructions thanks to barrel shift, hard to beat. I still think that 1ns in fastrand is not going to have any real world impact, though.

rasky commented 7 years ago

By adapting xorshift+ to 32-bit integers, using a triplet picked up by the original xorshift paper, I tested this:

func fastrand_xorshiftplus() uint32 {
        // WARNING: untested for statistical properties
    s0, s1 := state1, state0
    s1 ^= s1 << 5
    s1 = s1 ^ s0 ^ (s1 >> 17) ^ (s0 >> 13)
    state0, state1 = s0, s1
    return s0 + s1
}

which is faster than the original fastrand on my i7 skylake:

BenchmarkOrig-4             1000000000           2.18 ns/op
BenchmarkXorShift-4         1000000000           2.89 ns/op
BenchmarkXorShiftPlus-4     2000000000           1.94 ns/op

Notice that the triplet choice can be totally wrong, I've just used it to test speed, but we should probably try BigCrush on it, and maybe select a different triplet.

funny-falcon commented 7 years ago

I supposed that 2*32bit version will be faster. But will fastrand increase it's state from 1*32bit to 2*32bit? I hope, yes. Why you didn't take triplet from xorshift pdf ? https://www.jstatsoft.org/article/view/v008i14/xorshift.pdf For n = 64, a, b, c = [10, 13, 10], [8, 9, 22], [2, 7, 3], [23, 3, 24]

s1 ^= s1 << a
s1 ^= s0 ^ (s1 >> b) ^ (s0 >> c)

funny-falcon commented 7 years ago

Looks like tripple [10, 13, 10] doesn't pass diehard. Probably, it was mistaken. Triplet [10, 10, 13] passes, but could not be trusted.

rasky commented 7 years ago

I did take it from xorshift's pdf: [5,17,13] is one of them, valid for 32-bit numbers. The period of this xorshift+ 32-bit version (with two 32-bit words of state) should be 2^64-1.

funny-falcon commented 7 years ago

5,17,13 - is for 1*32 bit generator. Look at "3.1 Binary vector spaces of dimension n = 96, 128, 160 . . .." in that pdf

funny-falcon commented 7 years ago

shifts for 1*32bit version does not produce full period for 2*32bit generator. I know what I'm talking about, please believe me.

rasky commented 7 years ago

Yes, I see, you're right.

funny-falcon commented 7 years ago

I had an experince with running xorshift framework to generate all correct triplets. It is not too hard. At least, pdf tripplets should be checked, cause [10,13,10] is cleanly wrong (it has quite short period with seed 1,0).

rasky commented 7 years ago

I tried [23,3,24], it passes SmallCrush but not Crush:

       Test                          p-value
 ----------------------------------------------
  1  SerialOver, t = 2                eps
  2  SerialOver, t = 4                eps
  6  CollisionOver, t = 4           2.1e-25
  7  CollisionOver, t = 8           7.1e-63
 12  BirthdaySpacings, t = 3       1.2e-273
 13  BirthdaySpacings, t = 4        8.3e-26
 15  BirthdaySpacings, t = 7          eps
 16  BirthdaySpacings, t = 8          eps
 20  ClosePairs NP, t = 7            4.9e-7
 20  ClosePairs mNP, t = 7          1.9e-27
 20  ClosePairs mNP1, t = 7         7.1e-30
 20  ClosePairs mNP2, t = 7         1.1e-10
 20  ClosePairs NJumps, t = 7       3.1e-34
 36  Run of U01, r = 15               eps
 38  Permutation, r = 15              eps
 65  RandomWalk1 H (L = 90)         3.3e-16
 65  RandomWalk1 M (L = 90)          1.5e-5
 72  LinearComp, r = 29             1 - eps1
 87  HammingIndep, L = 300           2.1e-4
 ----------------------------------------------
 All other tests were passed

(full log)

funny-falcon commented 7 years ago

I was mistaken for [10,13,10] - i got error in my test program. So triplet from pdf could be trusted. I vote for [10,13,10] triplet. All triplets for xorshift64 using 2*32bit state found by xorshift framework You see: it contains all example triplets from pdf. And I think, there is no need in xorshift+, because xorshift is already enough for fastrand use case.

funny-falcon commented 7 years ago

With xorshift64 (2*32) I got following (number differs from test above cause it is other computer): master

goos: linux
goarch: amd64
pkg: runtime
BenchmarkFastrand           500000000            3.86 ns/op
BenchmarkFastrandHashiter   30000000            43.0 ns/op
BenchmarkFastrandn/2        500000000            3.72 ns/op
BenchmarkFastrandn/3        500000000            3.73 ns/op
BenchmarkFastrandn/4        500000000            3.72 ns/op
BenchmarkFastrandn/5        500000000            3.73 ns/op
PASS
ok      runtime 12.628s

xorshift

goos: linux
goarch: amd64
pkg: runtime
BenchmarkFastrand           500000000            3.73 ns/op
BenchmarkFastrandHashiter   30000000            53.1 ns/op
BenchmarkFastrandn/2        500000000            3.89 ns/op
BenchmarkFastrandn/3        500000000            3.88 ns/op
BenchmarkFastrandn/4        500000000            3.89 ns/op
BenchmarkFastrandn/5        500000000            3.88 ns/op
PASS
ok      runtime 13.250s

xorshift64+ is just a bit slower than xorshift64, but I don't think we need that "extra quality".

I leave priority to open changset to @rasky.

gopherbot commented 7 years ago

Change https://golang.org/cl/62530 mentions this issue: runtime: improve fastrand with a better generator

mundaym commented 7 years ago

Reopening for backport to 1.9.1.

rasky commented 7 years ago

@ianlancetaylor @randall77 should I backport the patch as-is or you prefer a simpler version (and if so, what would you simplify)?

randall77 commented 7 years ago

I like the CL as is.

gopherbot commented 7 years ago

Change https://golang.org/cl/64193 mentions this issue: [release-branch.go1.9] runtime: improve fastrand with a better generator

funny-falcon commented 6 years ago

Fix for sema.go https://go-review.googlesource.com/c/go/+/68050

gopherbot commented 6 years ago

Change https://golang.org/cl/68050 mentions this issue: runtime: fix using fastrand in sema.go

rsc commented 6 years ago

If we cherry-pick CL 64193 then we must also cherry-pick CL 68050, because CL 64193 breaks a different property of fastrand.

I'm pretty skeptical of wholesale-replacing the random number generator and then putting it into a point release when we just found a serious problem with it 9 days ago.

Maybe for Go 1.9.2 we should just fix fastrandn to use a % again. The move away from % got this performance win:

    name                    old time/op  new time/op  delta
    SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
    SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
    SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
    SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
    SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
    GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

I'm happy to give that back for Go 1.9.2. The uses of fastrandn elsewhere in the runtime also look like they would appreciate not having terrible correlations, so I don't think we should change only select's use.

rsc commented 6 years ago

Closing this and remilestoning Go 1.10. The fix for Go 1.9.2 is #22253.

golang / go