golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.72k stars 17.63k forks source link

runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) #52652

Open alphadose opened 2 years ago

alphadose commented 2 years ago

Current implementation of channels could use some improvements in terms of ns/op, B/op and allocs/op

I have made a POC thread-safe queue which has better performance in all the above 3 metrics https://github.com/alphadose/ZenQ

Here are the benchmarks vs native channels https://github.com/alphadose/ZenQ#benchmarks for a wide range of use cases

My proposal is to make the underlying implementation of native channels more efficient by changing the algorithm and data structure to the lock-free one used in ZenQ or some other suitable implementation (debatable)

ianlancetaylor commented 2 years ago

I assume that there are no API changes here, so taking this out of the proposal process.

@alphadose As I write this you have not signed the CLA, so we are unable to look at your code. Are you willing to sign the CLA as described at https://go.dev/doc/contribute#cla ? Thanks.

alphadose commented 2 years ago

I have submitted my CLA

ianlancetaylor commented 2 years ago

Thanks!

ianlancetaylor commented 2 years ago

I took a very quick look. Channels aren't just thread-safe queues; they also have to handle the select statement and the close function. Can we retain the speed improvements while still implementing all required features?

CC @golang/runtime

alphadose commented 2 years ago

Yep, I wanted that only

Syntactic Sugar of channels + awesome perf

Since channels are used a lot, this might help people out

alphadose commented 2 years ago

PS:- this is just a POC, as you said it needs to be fully compatible with the linked methods in https://go.dev/src/runtime/chan.go

egonelbre commented 2 years ago

Similarly, the implementation busy-spins when it waits, which is not suitable for the default implementation.

Also, there's a separate issue for having lock-free channels https://github.com/golang/go/issues/8899

alphadose commented 2 years ago

The spinning is applied on a slot in the queue whose size can range from 2 to 2^64 - 1

Due to the lack of a global lock, the spinning is used for contesting slots with slot specific states due to which there are low number of collisions among reader and writer indexes due to the large number of slots and also a lot less time is spent in the Compare and Swap for loops because each lock/state has its own individual memory address. But you are right, in case there are a lot of concurrent writer goroutines (1 million or more) its better to use exponential backoff or just make the goroutine sleep/park to reduce slot contention and optimize even more

As for https://github.com/golang/go/issues/8899, my implementation has lower allocs/op in comparison to that lock-free buffer as well as the native channel and is even a lot faster

egonelbre commented 2 years ago

You need to implement select, select fairness, parking and closing etc. before you can draw any conclusions on the performance. Of course, you can make a faster implementation when you discard some of the requirements.

Similarly, also run the benchmarks that are in https://github.com/golang/go/blob/master/src/runtime/chan_test.go. Also you need to show benchmarks such as 1M blocked writers. If you are not demonstrating where your code is worse, then it's not a fair evaluation.

alphadose commented 2 years ago

Got it, I will draw another benchmark to replicate the 1M blocked writers scenario for both channels and zenQ and note down the resource utilization

alphadose commented 2 years ago

After that, we can have a look at how to implement select {} like mechanism between multiple zenQs

egonelbre commented 2 years ago

Note for benchmarking please see https://pkg.go.dev/golang.org/x/tools/cmd/benchcmp and http://www.inanzzz.com/index.php/post/yz8n/using-golang-bench-benchstat-and-benchcmp-to-measure-performance. You need statistical significance when measuring.

I also recommend reading through the lock-free channel issue fully, the design documents and previous pull-requests, before continuing. That way it'll avoid some topics that have come up before.

alphadose commented 2 years ago

@egonelbre I was able to implement parking thereby saving a ton of goroutines, and now ZenQ is more efficient than channels even in the case of 1 million blocking goroutines

Here are the latest benchmarks run over 30 cases

name                                     time/op
_Chan_NumWriters1_InputSize600-8         24.6µs ± 1%
_ZenQ_NumWriters1_InputSize600-8         16.5µs ± 1%
_Chan_NumWriters3_InputSize60000-8       6.21ms ± 2%
_ZenQ_NumWriters3_InputSize60000-8       2.85ms ± 0%
_Chan_NumWriters8_InputSize6000000-8      735ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8      417ms ± 0%
_Chan_NumWriters100_InputSize6000000-8    1.61s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8    741ms ± 3%
_Chan_NumWriters1000_InputSize7000000-8   1.98s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8   1.05s ± 1%
_Chan_Million_Blocking_Writers-8          10.0s ±13%
_ZenQ_Million_Blocking_Writers-8          7.01s ±44%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8          0.00B
_ZenQ_NumWriters1_InputSize600-8          0.00B
_Chan_NumWriters3_InputSize60000-8         106B ±88%
_ZenQ_NumWriters3_InputSize60000-8       28.9B ±111%
_Chan_NumWriters8_InputSize6000000-8      946B ±267%
_ZenQ_NumWriters8_InputSize6000000-8      885B ±163%
_Chan_NumWriters100_InputSize6000000-8   46.7kB ±25%
_ZenQ_NumWriters100_InputSize6000000-8   16.2kB ±66%
_Chan_NumWriters1000_InputSize7000000-8   484kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  62.4kB ±82%
_Chan_Million_Blocking_Writers-8          553MB ± 0%
_ZenQ_Million_Blocking_Writers-8         95.9MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8           0.00
_ZenQ_NumWriters1_InputSize600-8           0.00
_Chan_NumWriters3_InputSize60000-8         0.00
_ZenQ_NumWriters3_InputSize60000-8         0.00
_Chan_NumWriters8_InputSize6000000-8      3.07 ±193%
_ZenQ_NumWriters8_InputSize6000000-8      2.07 ±142%
_Chan_NumWriters100_InputSize6000000-8      166 ±15%
_ZenQ_NumWriters100_InputSize6000000-8     53.5 ±50%
_Chan_NumWriters1000_InputSize7000000-8   1.74k ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8     525 ±39%
_Chan_Million_Blocking_Writers-8          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8          1.00M ± 0%

Further improvements:-

More improvements can be done by using runtime internal calls like goparkunlock() , getg() and goready() to schedule goroutines more effectively

But this requires assembly stubs to load the *g pointer, here is my current implementation please have a look at it https://github.com/alphadose/ZenQ/blob/main/lib_runtime_linkage.go and suggest if there are any better methods

alphadose commented 2 years ago

@egonelbre I was able to implement select{}, which does fair selection based on the least number of reads among various queues, implementation -> here

Here is an example code demonstrating its usage

package main

import (
    "fmt"

    "github.com/alphadose/zenq"
)

type custom1 struct {
    alpha int
    beta  string
}

type custom2 struct {
    gamma int
}

var (
    zq1 = zenq.New[int]()
    zq2 = zenq.New[string]()
    zq3 = zenq.New[custom1]()
    zq4 = zenq.New[*custom2]()
)

func main() {
    go looper(intProducer)
    go looper(stringProducer)
    go looper(custom1Producer)
    go looper(custom2Producer)

    for i := 0; i < 40; i++ {

        // Selection occurs here
        switch data := zenq.Select(zq1, zq2, zq3, zq4).(type) {
        case int:
            fmt.Printf("Received int %d\n", data)
        case string:
            fmt.Printf("Received string %s\n", data)
        case custom1:
            fmt.Printf("Received custom data type number 1 %#v\n", data)
        case *custom2:
            fmt.Printf("Received pointer %#v\n", data)
        }
    }
}

func intProducer(ctr int) { zq1.Write(ctr) }

func stringProducer(ctr int) { zq2.Write(fmt.Sprint(ctr * 10)) }

func custom1Producer(ctr int) { zq3.Write(custom1{alpha: ctr, beta: fmt.Sprint(ctr)}) }

func custom2Producer(ctr int) { zq4.Write(&custom2{gamma: 1 << ctr}) }

func looper(producer func(ctr int)) {
    for i := 0; i < 10; i++ {
        producer(i)
    }
}
egonelbre commented 2 years ago

@alphadose it's a polling implementation, which it should not be. The select contains a lockup with regards multiple selects on the same channel, even when there is a possibility for usual select to proceed.

Similarly, the fairness guarantee doesn't seem to hold:

func main() {
    zq1 := zenq.New[int32]()
    zq2 := zenq.New[uint32]()

    go func() { for { zq1.Write(1) } }()
    go func() { for { zq2.Write(1) } }()

    for i := 0; i < 1e3; i++ { zq1.Read() }

    for i := 0; i < 100; i++ {
        data := zenq.Select(zq1, zq2)
        fmt.Println(data) // this only currently prints 2; however the select should be fair between zq1 and zq2
    }
}

I'm all for implementing custom primitives for niche cases, however, I would recommend trying to improve the performance of the current implementation, rather than implementing all the minutia from scratch.

alphadose commented 2 years ago

Agreed regarding the fairness

Also for the performance, I am currently testing by writing assembly to load the *g pointer from thread local storage and then try benchmarking with parking/unparking implementation, hopefully we can see some good improvement

alphadose commented 2 years ago

@egonelbre I was able to get a good improvement in performance by using gopark() and goready(). Here is the comparison benchmark with the old algorithm run over 30 cases and queue_size = 4096

name                                     old time/op    new time/op    delta
_ZenQ_NumWriters1_InputSize600-8           16.5µs ± 1%    17.9µs ± 1%   +8.65%  (p=0.000 n=28+29)
_ZenQ_NumWriters3_InputSize60000-8         2.85ms ± 0%    2.67ms ± 6%   -6.11%  (p=0.000 n=23+30)
_ZenQ_NumWriters8_InputSize6000000-8        417ms ± 0%     313ms ± 5%  -24.83%  (p=0.000 n=23+29)
_ZenQ_NumWriters100_InputSize6000000-8      741ms ± 3%     516ms ± 2%  -30.40%  (p=0.000 n=29+30)
_ZenQ_NumWriters1000_InputSize7000000-8     1.05s ± 1%     0.45s ± 9%  -57.58%  (p=0.000 n=28+30)
_ZenQ_Million_Blocking_Writers-8            7.01s ±44%    10.98s ± 4%  +56.54%  (p=0.000 n=30+28)

name                                     old alloc/op   new alloc/op   delta
_ZenQ_NumWriters1_InputSize600-8            0.00B          0.00B          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8         28.9B ±111%    34.8B ±127%     ~     (p=0.268 n=30+29)
_ZenQ_NumWriters8_InputSize6000000-8        885B ±163%     671B ±222%     ~     (p=0.208 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8     16.2kB ±66%   13.3kB ±100%     ~     (p=0.072 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8    62.4kB ±82%    2.4kB ±210%  -96.20%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8           95.9MB ± 0%    95.5MB ± 0%   -0.41%  (p=0.000 n=28+30)

name                                     old allocs/op  new allocs/op  delta
_ZenQ_NumWriters1_InputSize600-8             0.00           0.00          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8           0.00           0.00          ~     (all equal)
_ZenQ_NumWriters8_InputSize6000000-8        2.07 ±142%     1.40 ±186%     ~     (p=0.081 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8       53.5 ±50%     31.8 ±100%  -40.60%  (p=0.000 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8       525 ±39%        6 ±227%  -98.95%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8            1.00M ± 0%     0.99M ± 0%   -0.41%  (p=0.000 n=28+29)

Here is the current comparison vs channels also run over 30 cases and queue_size = 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          17.9µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.48ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       679ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       313ms ± 5%
_Chan_NumWriters100_InputSize6000000-8     1.58s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8     516ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.97s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    445ms ± 9%
_Chan_Million_Blocking_Writers-8           10.8s ± 1%
_ZenQ_Million_Blocking_Writers-8           11.0s ± 4%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8         95.0B ±65%
_ZenQ_NumWriters3_InputSize60000-8        34.8B ±127%
_Chan_NumWriters8_InputSize6000000-8       878B ±272%
_ZenQ_NumWriters8_InputSize6000000-8       671B ±222%
_Chan_NumWriters100_InputSize6000000-8    46.0kB ±31%
_ZenQ_NumWriters100_InputSize6000000-8   13.3kB ±100%
_Chan_NumWriters1000_InputSize7000000-8    488kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8  2.37kB ±210%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8       1.40 ±186%
_Chan_NumWriters100_InputSize6000000-8       164 ±20%
_ZenQ_NumWriters100_InputSize6000000-8     31.8 ±100%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-8    5.50 ±227%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%

Also note that for queue_size = 4096, current ZenQ performs slower than channels in the case of million blocking writers but increasing the queue_size to 2^14 for both, ZenQ now also performs better than channels in the million goroutines case.

queue_size = 2^14

Benchmark_Chan_Million_Blocking_Writers-8                  1    10982172959 ns/op   551333344 B/op   1991416 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8                  1    9734108333 ns/op    92514952 B/op     963291 allocs/op
PASS
ok      command-line-arguments  41.376s
alphadose commented 2 years ago

@egonelbre

Update:- WIth a few tweaks, now ZenQ performs faster than channels even in the case of million goroutines for all queue_sizes

Here are the latest benchmarks for queue_size 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.2µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          18.1µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.52ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       680ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       308ms ± 4%
_Chan_NumWriters100_InputSize6000000-8     1.56s ± 6%
_ZenQ_NumWriters100_InputSize6000000-8     519ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    441ms ±11%
_Chan_Million_Blocking_Writers-8           10.4s ± 3%
_ZenQ_Million_Blocking_Writers-8           8.56s ±24%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8          110B ±68%
_ZenQ_NumWriters3_InputSize60000-8        23.6B ±107%
_Chan_NumWriters8_InputSize6000000-8       585B ±234%
_ZenQ_NumWriters8_InputSize6000000-8       411B ±299%
_Chan_NumWriters100_InputSize6000000-8    44.7kB ±35%
_ZenQ_NumWriters100_InputSize6000000-8    19.7kB ±78%
_Chan_NumWriters1000_InputSize7000000-8    483kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  1.13kB ±602%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.20 ±218%
_ZenQ_NumWriters8_InputSize6000000-8       0.90 ±344%
_Chan_NumWriters100_InputSize6000000-8       163 ±18%
_ZenQ_NumWriters100_InputSize6000000-8      47.0 ±79%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 6%
_ZenQ_NumWriters1000_InputSize7000000-8    2.00 ±550%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%
egonelbre commented 2 years ago

It does not make too much sense to talk about performance until there's feature parity. As I mentioned, it's easy to make a faster queue, if you don't need all the features. Similarly, there are still bugs in the Select, which make the benchmark results not useful.

This is the bug I mentioned in the previous comment (this program fails to complete sometimes):

package main

import (
    "sync"

    "github.com/alphadose/zenq"
)

func main() {
    var wg sync.WaitGroup
    defer wg.Wait()

    zq1 := zenq.New[int32]()
    zq2 := zenq.New[uint32]()

    wg.Add(2)
    go func() {
        defer wg.Done()
        for {
            v := zenq.Select(zq1, zq2)
            if v == int32(0) {
                break
            }
        }
    }()
    go func() {
        defer wg.Done()
        for {
            v := zenq.Select(zq1, zq2)
            if v == int32(0) {
                break
            }
        }
    }()

    for i := 0; i < 1e4; i++ {
        zq1.Write(1)
    }
    println("shutting down")
    zq1.Write(0) // using 0 because closing has not been implemented
    zq1.Write(0)
}

Also there's a mutex per element, which causes it to behave like a fifo. This is also introduces a significant memory overhead to things like make(chan struct{}, 1e9) and make(chan byte, 1e9).

EDIT: Removed note about fairness of two concurrent receives, as I'm not sure whether it's required.

egonelbre commented 2 years ago

Also, your implementation has an hardcoded size + requirement of power of 2 elements.

alphadose commented 2 years ago

Will work on the fairness of select by going through the channel tests

Also regarding the power of 2 elements, this is just a POC, a const is required for declaring arrays instead of slices if this gets merged into the golang core, users can provide any size for the channel and the underlying implementation can be fastlog2() to get the nearest power of 2 and build the channel on top of that for ex:- make(chan int, 100), the underlying size will be 1 << fastlog2(100) = ~128

The tradeoff for not using powers of 2 is that you have to use modulo instead of index masking to get read/write pointers which will cost a few nanoseconds of operation time

alphadose commented 2 years ago

Also there isn't 1 mutex per element, its just num(mutexes) = queue_size

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

alphadose commented 2 years ago

@egonelbre just benchmarked now with type Payload byte , there isn't much memory overhead, same as before

goos: darwin
goarch: arm64
Benchmark_Chan_NumWriters1_InputSize600-8              52896         19625 ns/op           0 B/op          0 allocs/op
Benchmark_ZenQ_NumWriters1_InputSize600-8              93112         12912 ns/op           0 B/op          0 allocs/op
Benchmark_Chan_NumWriters3_InputSize60000-8              262       4607065 ns/op          74 B/op          0 allocs/op
Benchmark_ZenQ_NumWriters3_InputSize60000-8              510       2291942 ns/op          35 B/op          0 allocs/op
Benchmark_Chan_NumWriters8_InputSize6000000-8              2     511401104 ns/op         512 B/op          2 allocs/op
Benchmark_ZenQ_NumWriters8_InputSize6000000-8              3     355385736 ns/op         970 B/op          2 allocs/op
Benchmark_Chan_NumWriters100_InputSize6000000-8            1    1266029250 ns/op       31104 B/op        134 allocs/op
Benchmark_ZenQ_NumWriters100_InputSize6000000-8            2     551560271 ns/op       28912 B/op         69 allocs/op
Benchmark_Chan_NumWriters1000_InputSize7000000-8           1    1860939083 ns/op      483072 B/op       1775 allocs/op
Benchmark_ZenQ_NumWriters1000_InputSize7000000-8           3     507001500 ns/op           0 B/op          0 allocs/op
Benchmark_Chan_Million_Blocking_Writers-8                  1    10854043417 ns/op   552819712 B/op   1997209 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8                  1    10369999125 ns/op   95540840 B/op     994865 allocs/op
PASS
ok      command-line-arguments  39.695s

Same with type Payload struct{}

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity ?

egonelbre commented 2 years ago

Also regarding the power of 2 elements...

I don't think restricting to pow2 is an acceptable tradeoff.

const is required for declaring arrays instead..

The slice should be implementable similarly how it's implemented currently. I suspect it still would need to be a slice in the runtime, otherwise it would increase binary bloat.

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

Size benchmark:

``` func BenchmarkChan_New(b *testing.B) { b.Run("struct{}", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { ch := make(chan struct{}, zenq.QueueSize) runtime.KeepAlive(ch) } }) b.Run("byte", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { ch := make(chan byte, zenq.QueueSize) runtime.KeepAlive(ch) } }) b.Run("int64", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { ch := make(chan int64, zenq.QueueSize) runtime.KeepAlive(ch) } }) } func BenchmarkZenQ_New(b *testing.B) { b.Run("struct{}", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { zq := zenq.New[struct{}]() runtime.KeepAlive(zq) } }) b.Run("byte", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { zq := zenq.New[byte]() runtime.KeepAlive(zq) } }) b.Run("int64", func(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { zq := zenq.New[int64]() runtime.KeepAlive(zq) } }) } ```
BenchmarkChan_New/struct{}-32           29258943                42.65 ns/op           96 B/op          1 allocs/op
BenchmarkChan_New/byte-32                1000000              1110 ns/op            4864 B/op          1 allocs/op
BenchmarkChan_New/int64-32                144070              8599 ns/op           40960 B/op          1 allocs/op
BenchmarkZenQ_New/struct{}-32              48247             27403 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/byte-32                  46503             26348 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/int64-32                 46122             26360 ns/op          139264 B/op          1 allocs/op
egonelbre commented 2 years ago

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity?

I don't think this is the full list and I'm not the person who can approve the change, but from the top of my head:

  1. Ensure it can pass all the tests in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  2. Run all the benchmarks in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  3. No polling (including select)
  4. SPIN model might be necessary, if you decide for a complete rewrite (there mentions of this in the other issue)
  5. Benchmark across multiple machines and many cores (e.g. 1,2,4,8,16,32,48)
  6. noticed that https://go-review.googlesource.com/c/go/+/16740/ indicates some fairness requirements -- i.e. requirement of FIFO (for non-select)
  7. Run benchmarks across the entire go repository and the external testsuite, demonstrating that common cases don't suffer. (e.g. ctx.Done(), timers and friends)
  8. Integrate it into runtime/chan.go -- there have been faster implementations before, so it's known the perf of the current implementation can be improved...
  9. plays well with the race detector

Or in other words, the problem isn't coming up with a slightly better queue, but integrating it into Go codebase and ensuring it doesn't break anything and preserves the necessary properties.

alphadose commented 2 years ago

Regarding the race detector, I cannot currently use race.Acquire since this is not runtime internal package this is only possible from within the go codebase

egonelbre commented 2 years ago

Ah, also select needs to support both send and recv. Similarly the case of recv/send from nil channels.

egonelbre commented 2 years ago

Apparently I made a critical error in one of my last comments:

I'm the person who can approve the change....

It should have said "I'm not the person who can approve..."... I now need to find the largest facepalm emote.

alphadose commented 2 years ago

@egonelbre

Updates:-

  1. Closing implemented
  2. Selection implemented without polling (scope for further improvements present)
  3. Performance improved

Now I will run this implementation on the test suite you provided and try to complete all tests including modifications to the current implementation if required

alphadose commented 2 years ago

@egonelbre Updates:-

By adding direct_send and an optimistic first pass read approach to the selector, I was able to improve its performance

Its now comparable to channels although still not as fast for large batch sizes Here are some basic benchmarks for selection

With Input Batch Size: 60 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 61.375µs
ZenQ Select Runner completed transfer in: 73.416µs
====================================================================

With Input Batch Size: 600 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 109.041µs
ZenQ Select Runner completed transfer in: 399.125µs
====================================================================

With Input Batch Size: 6000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 896.917µs
ZenQ Select Runner completed transfer in: 3.627416ms
====================================================================

With Input Batch Size: 600000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 129.959167ms
ZenQ Select Runner completed transfer in: 300.164333ms
====================================================================
egonelbre commented 2 years ago

@alphadose note, there's still a significant memory usage difference between zenq and channels. 5x for the usual case.

alphadose commented 2 years ago

In terms of memory usage, it would be better to split it into 2 categories, initialization memory, and operational memory usage for fair comparison

In terms of initialization memory, channels are better and have a lower footprint

But in terms of operational memory usage ZenQ is better

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B     
_ZenQ_NumWriters1_InputSize600-8           0.00B     
_Chan_NumWriters3_InputSize60000-8          104B ±54%
_ZenQ_NumWriters3_InputSize60000-8         22.2B ±91%
_Chan_NumWriters8_InputSize6000000-8       813B ±307%
_ZenQ_NumWriters8_InputSize6000000-8       690B ±115%
_Chan_NumWriters100_InputSize6000000-8    42.6kB ±36%
_ZenQ_NumWriters100_InputSize6000000-8   5.92kB ±118%
_Chan_NumWriters1000_InputSize7000000-8    475kB ±11%
_ZenQ_NumWriters1000_InputSize7000000-8   41.6kB ±34%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          47.4MB ± 9%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00     
_ZenQ_NumWriters1_InputSize600-8            0.00     
_Chan_NumWriters3_InputSize60000-8          0.00     
_ZenQ_NumWriters3_InputSize60000-8          0.00     
_Chan_NumWriters8_InputSize6000000-8       2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8        2.63 ±52%
_Chan_NumWriters100_InputSize6000000-8       157 ±17%
_ZenQ_NumWriters100_InputSize6000000-8     14.3 ±116%
_Chan_NumWriters1000_InputSize7000000-8    1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8     30.3 ±42%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8           1.00M ± 0%
alphadose commented 2 years ago

If you pay the initialization cost, you can gain a substantial reduction in operational memory usage

and the initialization cost is really cheap considering hardware nowadays

its the operational cost which needs to be reduced and optimised

egonelbre commented 2 years ago

There are many ephemeral channels for context cancellations (and similar) that are impacted by the memory usage. Of course, it's difficult to understand the real-world impact without testing. Based on my experience, small channels are more common than large channels that require high-throughput.

alphadose commented 2 years ago

true, if throughput is less then it is better to use channels to save resources

but there also have been cases where this was insufficient in which scenarios users implemented their own queues

imo golang should also provide a throughput optimised communication mechanism

egonelbre commented 2 years ago

If 95% (just guessing) of the channel usage is non-throughput related then it needs to be carefully balanced with throughput. Similarly, the performance and memory usage of such use-cases is probably more important for the stdlib than high-workload and high-throughput.

Similarly, As you've demonstrated it's possible to have external library queues that trade-off memory for performance. So there's no pressing need to have it in the stdlib, however, I definitely agree it would be nice.

alphadose commented 2 years ago

Actually, there is one more imp reason for making it a part of the std library

Currently, it is not possible to access the goroutine struct and other such things from external lib

In which case there are more optimisations that can be made making the current ZenQ even more efficient which wont be possible from external lib

Extracting the last amount of juice can only be done from golang internal lib and after this is done, no external lib will be able to match the performance of native (unless its a better algorithm)

alphadose commented 2 years ago

@egonelbre performance improved a lot in the case of million goroutines

name                                     time/op
_Chan_NumWriters1_InputSize600-8         23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8         38.7µs ± 1%
_Chan_NumWriters3_InputSize60000-8       5.48ms ± 1%
_ZenQ_NumWriters3_InputSize60000-8       2.63ms ± 1%
_Chan_NumWriters8_InputSize6000000-8      685ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8      254ms ± 3%
_Chan_NumWriters100_InputSize6000000-8    1.60s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8    298ms ± 1%
_Chan_NumWriters1000_InputSize7000000-8   1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8   409ms ± 1%
_Chan_Million_Blocking_Writers-8          10.5s ± 1%
_ZenQ_Million_Blocking_Writers-8          1.99s ±16%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8          0.00B
_ZenQ_NumWriters1_InputSize600-8          0.00B
_Chan_NumWriters3_InputSize60000-8       17.5B ±163%
_ZenQ_NumWriters3_InputSize60000-8       13.4B ±348%
_Chan_NumWriters8_InputSize6000000-8      123B ±148%
_ZenQ_NumWriters8_InputSize6000000-8       545B ±56%
_Chan_NumWriters100_InputSize6000000-8   36.1kB ±49%
_ZenQ_NumWriters100_InputSize6000000-8   9.32kB ±32%
_Chan_NumWriters1000_InputSize7000000-8   479kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8  89.3kB ± 5%
_Chan_Million_Blocking_Writers-8          553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          122MB ± 3%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8           0.00
_ZenQ_NumWriters1_InputSize600-8           0.00
_Chan_NumWriters3_InputSize60000-8         0.00
_ZenQ_NumWriters3_InputSize60000-8         0.00
_Chan_NumWriters8_InputSize6000000-8      1.10 ±173%
_ZenQ_NumWriters8_InputSize6000000-8       3.19 ±57%
_Chan_NumWriters100_InputSize6000000-8      140 ±32%
_ZenQ_NumWriters100_InputSize6000000-8     21.8 ±33%
_Chan_NumWriters1000_InputSize7000000-8   1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8    46.5 ±27%
_Chan_Million_Blocking_Writers-8          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8          1.00M ± 0%
alphadose commented 2 years ago

@egonelbre Updates:-

operational memory usage reduced drastically (especially evident in the case of 1000 goroutines)

goos: darwin
goarch: arm64

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.4µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          17.7µs ± 0%
_Chan_NumWriters3_InputSize60000-8        5.54ms ± 5%
_ZenQ_NumWriters3_InputSize60000-8        2.63ms ± 2%
_Chan_NumWriters8_InputSize6000000-8       687ms ± 2%
_ZenQ_NumWriters8_InputSize6000000-8       243ms ± 4%
_Chan_NumWriters100_InputSize6000000-8     1.59s ± 4%
_ZenQ_NumWriters100_InputSize6000000-8     296ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.97s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8    409ms ± 2%
_Chan_Million_Blocking_Writers-8           10.4s ± 4%
_ZenQ_Million_Blocking_Writers-8           1.83s ±10%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8          117B ±63%
_ZenQ_NumWriters3_InputSize60000-8        22.1B ±122%
_Chan_NumWriters8_InputSize6000000-8     1.01kB ±196%
_ZenQ_NumWriters8_InputSize6000000-8      1.12kB ±89%
_Chan_NumWriters100_InputSize6000000-8    42.6kB ±37%
_ZenQ_NumWriters100_InputSize6000000-8    11.3kB ±28%
_Chan_NumWriters1000_InputSize7000000-8    481kB ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8   90.5kB ± 6%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8           123MB ± 4%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       3.43 ±162%
_ZenQ_NumWriters8_InputSize6000000-8        5.23 ±53%
_Chan_NumWriters100_InputSize6000000-8       158 ±20%
_ZenQ_NumWriters100_InputSize6000000-8      26.3 ±29%
_Chan_NumWriters1000_InputSize7000000-8    1.76k ± 2%
_ZenQ_NumWriters1000_InputSize7000000-8     48.3 ±28%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8           1.00M ± 0%
alphadose commented 2 years ago

Here are a few more benchmarks for different environments/systems

  1. windows, amd64, 16 cores
    
    goos: windows
    goarch: amd64
    cpu: AMD Ryzen 7 5800H with Radeon Graphics

name time/op _Chan_NumWriters1_InputSize600-16 24.5µs ± 5% _ZenQ_NumWriters1_InputSize600-16 17.7µs ± 2% _Chan_NumWriters3_InputSize60000-16 4.75ms ± 3% _ZenQ_NumWriters3_InputSize60000-16 1.86ms ± 1% _Chan_NumWriters8_InputSize6000000-16 800ms ± 5% _ZenQ_NumWriters8_InputSize6000000-16 150ms ± 1% _Chan_NumWriters100_InputSize6000000-16 1.66s ± 1% _ZenQ_NumWriters100_InputSize6000000-16 160ms ± 1% _Chan_NumWriters1000_InputSize7000000-16 1.95s ± 1% _ZenQ_NumWriters1000_InputSize7000000-16 269ms ± 1% _Chan_Million_Blocking_Writers-16 5.79s ± 2% _ZenQ_Million_Blocking_Writers-16 1.37s ± 6%

name alloc/op _Chan_NumWriters1_InputSize600-16 0.00B _ZenQ_NumWriters1_InputSize600-16 0.00B _Chan_NumWriters3_InputSize60000-16 150B ±57% _ZenQ_NumWriters3_InputSize60000-16 20.6B ±201% _Chan_NumWriters8_InputSize6000000-16 472B ±283% _ZenQ_NumWriters8_InputSize6000000-16 1.05kB ±58% _Chan_NumWriters100_InputSize6000000-16 43.7kB ±38% _ZenQ_NumWriters100_InputSize6000000-16 29.7kB ±17% _Chan_NumWriters1000_InputSize7000000-16 484kB ± 7% _ZenQ_NumWriters1000_InputSize7000000-16 120kB ±14% _Chan_Million_Blocking_Writers-16 553MB ± 0% _ZenQ_Million_Blocking_Writers-16 128MB ± 4%

name allocs/op _Chan_NumWriters1_InputSize600-16 0.00 _ZenQ_NumWriters1_InputSize600-16 0.00 _Chan_NumWriters3_InputSize60000-16 0.00 _ZenQ_NumWriters3_InputSize60000-16 0.00 _Chan_NumWriters8_InputSize6000000-16 2.00 ±150% _ZenQ_NumWriters8_InputSize6000000-16 3.90 ±28% _Chan_NumWriters100_InputSize6000000-16 148 ±34% _ZenQ_NumWriters100_InputSize6000000-16 68.3 ±24% _Chan_NumWriters1000_InputSize7000000-16 1.79k ± 5% _ZenQ_NumWriters1000_InputSize7000000-16 62.3 ±36% _Chan_Million_Blocking_Writers-16 2.00M ± 0% _ZenQ_Million_Blocking_Writers-16 1.00M ± 0%


2. ubuntu, amd64, 16 cores

goos: linux goarch: amd64 cpu: AMD Ryzen 7 5800H with Radeon Graphics

name time/op _Chan_NumWriters1_InputSize600-16 23.4µs ± 4% _ZenQ_NumWriters1_InputSize600-16 33.1µs ± 4% _Chan_NumWriters3_InputSize60000-16 2.59ms ± 3% _ZenQ_NumWriters3_InputSize60000-16 1.79ms ± 1% _Chan_NumWriters8_InputSize6000000-16 334ms ± 6% _ZenQ_NumWriters8_InputSize6000000-16 162ms ± 4% _Chan_NumWriters100_InputSize6000000-16 515ms ± 6% _ZenQ_NumWriters100_InputSize6000000-16 170ms ± 3% _Chan_NumWriters1000_InputSize7000000-16 1.76s ± 3% _ZenQ_NumWriters1000_InputSize7000000-16 273ms ± 2% _Chan_Million_Blocking_Writers-16 4.52s ± 5% _ZenQ_Million_Blocking_Writers-16 1.27s ±14%

name alloc/op _Chan_NumWriters1_InputSize600-16 0.00B _ZenQ_NumWriters1_InputSize600-16 0.00B _Chan_NumWriters3_InputSize60000-16 91.7B ±51% _ZenQ_NumWriters3_InputSize60000-16 4.00B ± 0% _Chan_NumWriters8_InputSize6000000-16 487B ±275% _ZenQ_NumWriters8_InputSize6000000-16 879B ±111% _Chan_NumWriters100_InputSize6000000-16 30.0kB ±47% _ZenQ_NumWriters100_InputSize6000000-16 23.2kB ±54% _Chan_NumWriters1000_InputSize7000000-16 463kB ±11% _ZenQ_NumWriters1000_InputSize7000000-16 129kB ±10% _Chan_Million_Blocking_Writers-16 553MB ± 0% _ZenQ_Million_Blocking_Writers-16 124MB ± 3%

name allocs/op _Chan_NumWriters1_InputSize600-16 0.00 _ZenQ_NumWriters1_InputSize600-16 0.00 _Chan_NumWriters3_InputSize60000-16 0.00 _ZenQ_NumWriters3_InputSize60000-16 0.00 _Chan_NumWriters8_InputSize6000000-16 1.57 ±219% _ZenQ_NumWriters8_InputSize6000000-16 3.48 ±44% _Chan_NumWriters100_InputSize6000000-16 87.8 ±40% _ZenQ_NumWriters100_InputSize6000000-16 54.3 ±54% _Chan_NumWriters1000_InputSize7000000-16 1.67k ± 9% _ZenQ_NumWriters1000_InputSize7000000-16 63.5 ±10% _Chan_Million_Blocking_Writers-16 2.00M ± 0% _ZenQ_Million_Blocking_Writers-16 1.00M ± 0%

NyaaaWhatsUpDoc commented 2 years ago

I would be interested to see these benchmark results on other architectures, and those with less resources to work with. As a Ryzen 5800H is quite beefy compared to a lot of the systems these could be running on. Your implementation scales much better on the larger end of input/writers, but I'd be interested how the initialization overhead scales down on more constrained systems.

alphadose commented 2 years ago

cool, I was collecting benchmarks for all kinds of systems and will definitely benchmark this in low end systems like raspberry pi.

I am gathering all benchmarks here

alphadose commented 2 years ago

@NyaaaWhatsUpDoc apologies for the delay, here are the benchmarks for a low end device (32 bit raspberry pi)

Device Info

  `.::///+:/-.        --///+//-:``    alphadose@neverwinter
 `+oooooooooooo:   `+oooooooooooo:    -------------------
  /oooo++//ooooo:  ooooo+//+ooooo.    OS: Raspbian GNU/Linux 11 (bullseye) armv7l
  `+ooooooo:-:oo-  +o+::/ooooooo:     Host: Raspberry Pi 4 Model B Rev 1.5
   `:oooooooo+``    `.oooooooo+-      Kernel: 5.15.32-v7l+
     `:++ooo/.        :+ooo+/.`       Uptime: 1 hour, 58 mins
        ...`  `.----.` ``..           Packages: 569 (dpkg)
     .::::-``:::::::::.`-:::-`        Shell: bash 5.1.4
    -:::-`   .:::::::-`  `-:::-       Terminal: /dev/pts/0
   `::.  `.--.`  `` `.---.``.::`      CPU: BCM2711 (4) @ 1.800GHz
       .::::::::`  -::::::::` `       Memory: 68MiB / 3838MiB
 .::` .:::::::::- `::::::::::``::.
-:::` ::::::::::.  ::::::::::.`:::-
::::  -::::::::.   `-::::::::  ::::
-::-   .-:::-.``....``.-::-.   -::-
 .. ``       .::::::::.     `..`..
   -:::-`   -::::::::::`  .:::::`
   :::::::` -::::::::::` :::::::.
   .:::::::  -::::::::. ::::::::
    `-:::::`   ..--.`   ::::::.
      `...`  `...--..`  `...`
            .::::::::::
             `.-::::-`

Benchstat of 20 cases

goos: linux
goarch: arm
name                                     time/op
_Chan_NumWriters1_InputSize600-4          230µs ± 4%
_ZenQ_NumWriters1_InputSize600-4          186µs ± 5%
_Chan_NumWriters3_InputSize60000-4       28.2ms ± 3%
_ZenQ_NumWriters3_InputSize60000-4       12.8ms ± 0%
_Chan_NumWriters8_InputSize6000000-4      4.14s ±10%
_ZenQ_NumWriters8_InputSize6000000-4      1.32s ± 1%
_Chan_NumWriters100_InputSize6000000-4    5.97s ± 5%
_ZenQ_NumWriters100_InputSize6000000-4    1.48s ± 5%
_Chan_NumWriters1000_InputSize7000000-4   7.23s ± 6%
_ZenQ_NumWriters1000_InputSize7000000-4   2.09s ± 4%
_Chan_Million_Blocking_Writers-4          20.3s ± 2%
_ZenQ_Million_Blocking_Writers-4          6.96s ± 4%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-4          0.00B
_ZenQ_NumWriters1_InputSize600-4          0.00B
_Chan_NumWriters3_InputSize60000-4         227B ±27%
_ZenQ_NumWriters3_InputSize60000-4        77.9B ±91%
_Chan_NumWriters8_InputSize6000000-4      499B ±189%
_ZenQ_NumWriters8_InputSize6000000-4     1.49kB ± 4%
_Chan_NumWriters100_InputSize6000000-4   27.5kB ±19%
_ZenQ_NumWriters100_InputSize6000000-4   27.7kB ±42%
_Chan_NumWriters1000_InputSize7000000-4   290kB ± 5%
_ZenQ_NumWriters1000_InputSize7000000-4   135kB ± 8%
_Chan_Million_Blocking_Writers-4          325MB ± 0%
_ZenQ_Million_Blocking_Writers-4         76.2MB ± 3%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-4           0.00
_ZenQ_NumWriters1_InputSize600-4           0.00
_Chan_NumWriters3_InputSize60000-4         1.00 ± 0%
_ZenQ_NumWriters3_InputSize60000-4         0.00
_Chan_NumWriters8_InputSize6000000-4      4.30 ±109%
_ZenQ_NumWriters8_InputSize6000000-4       19.2 ± 9%
_Chan_NumWriters100_InputSize6000000-4      171 ±13%
_ZenQ_NumWriters100_InputSize6000000-4      194 ±25%
_Chan_NumWriters1000_InputSize7000000-4   1.84k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-4   1.09k ± 4%
_Chan_Million_Blocking_Writers-4          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-4          1.00M ± 0%

We can see that ZenQ scales well in both low end as well as high end devices

alphadose commented 2 years ago

ZenQ's select{} performance has improved too since last time

❯ go run benchmarks/selector/main.go
With Input Batch Size: 60 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 52.084µs
ZenQ Select Runner completed transfer in: 60.584µs
====================================================================

With Input Batch Size: 600 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 160.708µs
ZenQ Select Runner completed transfer in: 349.792µs
====================================================================

With Input Batch Size: 6000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 2.030959ms
ZenQ Select Runner completed transfer in: 3.499ms
====================================================================

With Input Batch Size: 600000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 174.827458ms
ZenQ Select Runner completed transfer in: 226.949375ms
====================================================================

For selection ZenQ uses golang interfaces which internally work via dynamic dispatch, this cost adds up and attributes to the slower performance as compared to channels.