Open alphadose opened 2 years ago
I assume that there are no API changes here, so taking this out of the proposal process.
@alphadose As I write this you have not signed the CLA, so we are unable to look at your code. Are you willing to sign the CLA as described at https://go.dev/doc/contribute#cla ? Thanks.
I have submitted my CLA
Thanks!
I took a very quick look. Channels aren't just thread-safe queues; they also have to handle the select
statement and the close
function. Can we retain the speed improvements while still implementing all required features?
CC @golang/runtime
Yep, I wanted that only
Syntactic Sugar of channels + awesome perf
Since channels are used a lot, this might help people out
PS:- this is just a POC, as you said it needs to be fully compatible with the linked methods in https://go.dev/src/runtime/chan.go
Similarly, the implementation busy-spins when it waits, which is not suitable for the default implementation.
Also, there's a separate issue for having lock-free channels https://github.com/golang/go/issues/8899
The spinning is applied on a slot in the queue whose size can range from 2 to 2^64 - 1
Due to the lack of a global lock, the spinning is used for contesting slots with slot specific states due to which there are low number of collisions among reader and writer indexes due to the large number of slots and also a lot less time is spent in the Compare and Swap for loops because each lock/state has its own individual memory address. But you are right, in case there are a lot of concurrent writer goroutines (1 million or more) its better to use exponential backoff or just make the goroutine sleep/park to reduce slot contention and optimize even more
As for https://github.com/golang/go/issues/8899, my implementation has lower allocs/op in comparison to that lock-free buffer as well as the native channel and is even a lot faster
You need to implement select, select fairness, parking and closing etc. before you can draw any conclusions on the performance. Of course, you can make a faster implementation when you discard some of the requirements.
Similarly, also run the benchmarks that are in https://github.com/golang/go/blob/master/src/runtime/chan_test.go. Also you need to show benchmarks such as 1M blocked writers. If you are not demonstrating where your code is worse, then it's not a fair evaluation.
Got it, I will draw another benchmark to replicate the 1M blocked writers scenario for both channels and zenQ and note down the resource utilization
After that, we can have a look at how to implement select {} like mechanism between multiple zenQs
Note for benchmarking please see https://pkg.go.dev/golang.org/x/tools/cmd/benchcmp and http://www.inanzzz.com/index.php/post/yz8n/using-golang-bench-benchstat-and-benchcmp-to-measure-performance. You need statistical significance when measuring.
I also recommend reading through the lock-free channel issue fully, the design documents and previous pull-requests, before continuing. That way it'll avoid some topics that have come up before.
@egonelbre I was able to implement parking thereby saving a ton of goroutines, and now ZenQ is more efficient than channels even in the case of 1 million blocking goroutines
Here are the latest benchmarks run over 30 cases
name time/op
_Chan_NumWriters1_InputSize600-8 24.6µs ± 1%
_ZenQ_NumWriters1_InputSize600-8 16.5µs ± 1%
_Chan_NumWriters3_InputSize60000-8 6.21ms ± 2%
_ZenQ_NumWriters3_InputSize60000-8 2.85ms ± 0%
_Chan_NumWriters8_InputSize6000000-8 735ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8 417ms ± 0%
_Chan_NumWriters100_InputSize6000000-8 1.61s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8 741ms ± 3%
_Chan_NumWriters1000_InputSize7000000-8 1.98s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8 1.05s ± 1%
_Chan_Million_Blocking_Writers-8 10.0s ±13%
_ZenQ_Million_Blocking_Writers-8 7.01s ±44%
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 106B ±88%
_ZenQ_NumWriters3_InputSize60000-8 28.9B ±111%
_Chan_NumWriters8_InputSize6000000-8 946B ±267%
_ZenQ_NumWriters8_InputSize6000000-8 885B ±163%
_Chan_NumWriters100_InputSize6000000-8 46.7kB ±25%
_ZenQ_NumWriters100_InputSize6000000-8 16.2kB ±66%
_Chan_NumWriters1000_InputSize7000000-8 484kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8 62.4kB ±82%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 95.9MB ± 0%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 3.07 ±193%
_ZenQ_NumWriters8_InputSize6000000-8 2.07 ±142%
_Chan_NumWriters100_InputSize6000000-8 166 ±15%
_ZenQ_NumWriters100_InputSize6000000-8 53.5 ±50%
_Chan_NumWriters1000_InputSize7000000-8 1.74k ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8 525 ±39%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 1.00M ± 0%
More improvements can be done by using runtime internal calls like goparkunlock()
, getg()
and goready()
to schedule goroutines more effectively
But this requires assembly stubs to load the *g pointer, here is my current implementation please have a look at it https://github.com/alphadose/ZenQ/blob/main/lib_runtime_linkage.go and suggest if there are any better methods
@egonelbre I was able to implement select{}
, which does fair selection based on the least number of reads among various queues, implementation -> here
Here is an example code demonstrating its usage
package main
import (
"fmt"
"github.com/alphadose/zenq"
)
type custom1 struct {
alpha int
beta string
}
type custom2 struct {
gamma int
}
var (
zq1 = zenq.New[int]()
zq2 = zenq.New[string]()
zq3 = zenq.New[custom1]()
zq4 = zenq.New[*custom2]()
)
func main() {
go looper(intProducer)
go looper(stringProducer)
go looper(custom1Producer)
go looper(custom2Producer)
for i := 0; i < 40; i++ {
// Selection occurs here
switch data := zenq.Select(zq1, zq2, zq3, zq4).(type) {
case int:
fmt.Printf("Received int %d\n", data)
case string:
fmt.Printf("Received string %s\n", data)
case custom1:
fmt.Printf("Received custom data type number 1 %#v\n", data)
case *custom2:
fmt.Printf("Received pointer %#v\n", data)
}
}
}
func intProducer(ctr int) { zq1.Write(ctr) }
func stringProducer(ctr int) { zq2.Write(fmt.Sprint(ctr * 10)) }
func custom1Producer(ctr int) { zq3.Write(custom1{alpha: ctr, beta: fmt.Sprint(ctr)}) }
func custom2Producer(ctr int) { zq4.Write(&custom2{gamma: 1 << ctr}) }
func looper(producer func(ctr int)) {
for i := 0; i < 10; i++ {
producer(i)
}
}
@alphadose it's a polling implementation, which it should not be. The select contains a lockup with regards multiple selects on the same channel, even when there is a possibility for usual select to proceed.
Similarly, the fairness guarantee doesn't seem to hold:
func main() {
zq1 := zenq.New[int32]()
zq2 := zenq.New[uint32]()
go func() { for { zq1.Write(1) } }()
go func() { for { zq2.Write(1) } }()
for i := 0; i < 1e3; i++ { zq1.Read() }
for i := 0; i < 100; i++ {
data := zenq.Select(zq1, zq2)
fmt.Println(data) // this only currently prints 2; however the select should be fair between zq1 and zq2
}
}
I'm all for implementing custom primitives for niche cases, however, I would recommend trying to improve the performance of the current implementation, rather than implementing all the minutia from scratch.
Agreed regarding the fairness
Also for the performance, I am currently testing by writing assembly to load the *g pointer from thread local storage and then try benchmarking with parking/unparking implementation, hopefully we can see some good improvement
@egonelbre I was able to get a good improvement in performance by using gopark()
and goready()
. Here is the comparison benchmark with the old algorithm run over 30 cases and queue_size = 4096
name old time/op new time/op delta
_ZenQ_NumWriters1_InputSize600-8 16.5µs ± 1% 17.9µs ± 1% +8.65% (p=0.000 n=28+29)
_ZenQ_NumWriters3_InputSize60000-8 2.85ms ± 0% 2.67ms ± 6% -6.11% (p=0.000 n=23+30)
_ZenQ_NumWriters8_InputSize6000000-8 417ms ± 0% 313ms ± 5% -24.83% (p=0.000 n=23+29)
_ZenQ_NumWriters100_InputSize6000000-8 741ms ± 3% 516ms ± 2% -30.40% (p=0.000 n=29+30)
_ZenQ_NumWriters1000_InputSize7000000-8 1.05s ± 1% 0.45s ± 9% -57.58% (p=0.000 n=28+30)
_ZenQ_Million_Blocking_Writers-8 7.01s ±44% 10.98s ± 4% +56.54% (p=0.000 n=30+28)
name old alloc/op new alloc/op delta
_ZenQ_NumWriters1_InputSize600-8 0.00B 0.00B ~ (all equal)
_ZenQ_NumWriters3_InputSize60000-8 28.9B ±111% 34.8B ±127% ~ (p=0.268 n=30+29)
_ZenQ_NumWriters8_InputSize6000000-8 885B ±163% 671B ±222% ~ (p=0.208 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8 16.2kB ±66% 13.3kB ±100% ~ (p=0.072 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8 62.4kB ±82% 2.4kB ±210% -96.20% (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8 95.9MB ± 0% 95.5MB ± 0% -0.41% (p=0.000 n=28+30)
name old allocs/op new allocs/op delta
_ZenQ_NumWriters1_InputSize600-8 0.00 0.00 ~ (all equal)
_ZenQ_NumWriters3_InputSize60000-8 0.00 0.00 ~ (all equal)
_ZenQ_NumWriters8_InputSize6000000-8 2.07 ±142% 1.40 ±186% ~ (p=0.081 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8 53.5 ±50% 31.8 ±100% -40.60% (p=0.000 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8 525 ±39% 6 ±227% -98.95% (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8 1.00M ± 0% 0.99M ± 0% -0.41% (p=0.000 n=28+29)
Here is the current comparison vs channels also run over 30 cases and queue_size = 4096
name time/op
_Chan_NumWriters1_InputSize600-8 23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8 17.9µs ± 1%
_Chan_NumWriters3_InputSize60000-8 5.48ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8 2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8 679ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8 313ms ± 5%
_Chan_NumWriters100_InputSize6000000-8 1.58s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8 516ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8 1.97s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8 445ms ± 9%
_Chan_Million_Blocking_Writers-8 10.8s ± 1%
_ZenQ_Million_Blocking_Writers-8 11.0s ± 4%
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 95.0B ±65%
_ZenQ_NumWriters3_InputSize60000-8 34.8B ±127%
_Chan_NumWriters8_InputSize6000000-8 878B ±272%
_ZenQ_NumWriters8_InputSize6000000-8 671B ±222%
_Chan_NumWriters100_InputSize6000000-8 46.0kB ±31%
_ZenQ_NumWriters100_InputSize6000000-8 13.3kB ±100%
_Chan_NumWriters1000_InputSize7000000-8 488kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8 2.37kB ±210%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 95.5MB ± 0%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8 1.40 ±186%
_Chan_NumWriters100_InputSize6000000-8 164 ±20%
_ZenQ_NumWriters100_InputSize6000000-8 31.8 ±100%
_Chan_NumWriters1000_InputSize7000000-8 1.79k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-8 5.50 ±227%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 995k ± 0%
Also note that for queue_size = 4096, current ZenQ performs slower than channels in the case of million blocking writers but increasing the queue_size to 2^14 for both, ZenQ now also performs better than channels in the million goroutines case.
queue_size = 2^14
Benchmark_Chan_Million_Blocking_Writers-8 1 10982172959 ns/op 551333344 B/op 1991416 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8 1 9734108333 ns/op 92514952 B/op 963291 allocs/op
PASS
ok command-line-arguments 41.376s
@egonelbre
Update:- WIth a few tweaks, now ZenQ performs faster than channels even in the case of million goroutines for all queue_sizes
Here are the latest benchmarks for queue_size 4096
name time/op
_Chan_NumWriters1_InputSize600-8 23.2µs ± 1%
_ZenQ_NumWriters1_InputSize600-8 18.1µs ± 1%
_Chan_NumWriters3_InputSize60000-8 5.52ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8 2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8 680ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8 308ms ± 4%
_Chan_NumWriters100_InputSize6000000-8 1.56s ± 6%
_ZenQ_NumWriters100_InputSize6000000-8 519ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8 1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8 441ms ±11%
_Chan_Million_Blocking_Writers-8 10.4s ± 3%
_ZenQ_Million_Blocking_Writers-8 8.56s ±24%
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 110B ±68%
_ZenQ_NumWriters3_InputSize60000-8 23.6B ±107%
_Chan_NumWriters8_InputSize6000000-8 585B ±234%
_ZenQ_NumWriters8_InputSize6000000-8 411B ±299%
_Chan_NumWriters100_InputSize6000000-8 44.7kB ±35%
_ZenQ_NumWriters100_InputSize6000000-8 19.7kB ±78%
_Chan_NumWriters1000_InputSize7000000-8 483kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8 1.13kB ±602%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 95.5MB ± 0%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 2.20 ±218%
_ZenQ_NumWriters8_InputSize6000000-8 0.90 ±344%
_Chan_NumWriters100_InputSize6000000-8 163 ±18%
_ZenQ_NumWriters100_InputSize6000000-8 47.0 ±79%
_Chan_NumWriters1000_InputSize7000000-8 1.79k ± 6%
_ZenQ_NumWriters1000_InputSize7000000-8 2.00 ±550%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 995k ± 0%
It does not make too much sense to talk about performance until there's feature parity. As I mentioned, it's easy to make a faster queue, if you don't need all the features. Similarly, there are still bugs in the Select, which make the benchmark results not useful.
This is the bug I mentioned in the previous comment (this program fails to complete sometimes):
package main
import (
"sync"
"github.com/alphadose/zenq"
)
func main() {
var wg sync.WaitGroup
defer wg.Wait()
zq1 := zenq.New[int32]()
zq2 := zenq.New[uint32]()
wg.Add(2)
go func() {
defer wg.Done()
for {
v := zenq.Select(zq1, zq2)
if v == int32(0) {
break
}
}
}()
go func() {
defer wg.Done()
for {
v := zenq.Select(zq1, zq2)
if v == int32(0) {
break
}
}
}()
for i := 0; i < 1e4; i++ {
zq1.Write(1)
}
println("shutting down")
zq1.Write(0) // using 0 because closing has not been implemented
zq1.Write(0)
}
Also there's a mutex per element, which causes it to behave like a fifo. This is also introduces a significant memory overhead to things like make(chan struct{}, 1e9)
and make(chan byte, 1e9)
.
EDIT: Removed note about fairness of two concurrent receives, as I'm not sure whether it's required.
Also, your implementation has an hardcoded size + requirement of power of 2 elements.
Will work on the fairness of select by going through the channel tests
Also regarding the power of 2 elements, this is just a POC, a const is required for declaring arrays instead of slices if this gets merged into the golang core, users can provide any size for the channel and the underlying implementation can be fastlog2() to get the nearest power of 2 and build the channel on top of that for ex:- make(chan int, 100), the underlying size will be 1 << fastlog2(100) = ~128
The tradeoff for not using powers of 2 is that you have to use modulo instead of index masking to get read/write pointers which will cost a few nanoseconds of operation time
Also there isn't 1 mutex per element, its just num(mutexes) = queue_size
there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels
@egonelbre just benchmarked now with type Payload byte
, there isn't much memory overhead, same as before
goos: darwin
goarch: arm64
Benchmark_Chan_NumWriters1_InputSize600-8 52896 19625 ns/op 0 B/op 0 allocs/op
Benchmark_ZenQ_NumWriters1_InputSize600-8 93112 12912 ns/op 0 B/op 0 allocs/op
Benchmark_Chan_NumWriters3_InputSize60000-8 262 4607065 ns/op 74 B/op 0 allocs/op
Benchmark_ZenQ_NumWriters3_InputSize60000-8 510 2291942 ns/op 35 B/op 0 allocs/op
Benchmark_Chan_NumWriters8_InputSize6000000-8 2 511401104 ns/op 512 B/op 2 allocs/op
Benchmark_ZenQ_NumWriters8_InputSize6000000-8 3 355385736 ns/op 970 B/op 2 allocs/op
Benchmark_Chan_NumWriters100_InputSize6000000-8 1 1266029250 ns/op 31104 B/op 134 allocs/op
Benchmark_ZenQ_NumWriters100_InputSize6000000-8 2 551560271 ns/op 28912 B/op 69 allocs/op
Benchmark_Chan_NumWriters1000_InputSize7000000-8 1 1860939083 ns/op 483072 B/op 1775 allocs/op
Benchmark_ZenQ_NumWriters1000_InputSize7000000-8 3 507001500 ns/op 0 B/op 0 allocs/op
Benchmark_Chan_Million_Blocking_Writers-8 1 10854043417 ns/op 552819712 B/op 1997209 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8 1 10369999125 ns/op 95540840 B/op 994865 allocs/op
PASS
ok command-line-arguments 39.695s
Same with type Payload struct{}
Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity ?
Also regarding the power of 2 elements...
I don't think restricting to pow2 is an acceptable tradeoff.
const is required for declaring arrays instead..
The slice should be implementable similarly how it's implemented currently. I suspect it still would need to be a slice in the runtime, otherwise it would increase binary bloat.
there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels
Size benchmark:
BenchmarkChan_New/struct{}-32 29258943 42.65 ns/op 96 B/op 1 allocs/op
BenchmarkChan_New/byte-32 1000000 1110 ns/op 4864 B/op 1 allocs/op
BenchmarkChan_New/int64-32 144070 8599 ns/op 40960 B/op 1 allocs/op
BenchmarkZenQ_New/struct{}-32 48247 27403 ns/op 139264 B/op 1 allocs/op
BenchmarkZenQ_New/byte-32 46503 26348 ns/op 139264 B/op 1 allocs/op
BenchmarkZenQ_New/int64-32 46122 26360 ns/op 139264 B/op 1 allocs/op
Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity?
I don't think this is the full list and I'm not the person who can approve the change, but from the top of my head:
Or in other words, the problem isn't coming up with a slightly better queue, but integrating it into Go codebase and ensuring it doesn't break anything and preserves the necessary properties.
Regarding the race detector, I cannot currently use race.Acquire since this is not runtime internal package this is only possible from within the go codebase
Ah, also select needs to support both send and recv. Similarly the case of recv/send from nil channels.
Apparently I made a critical error in one of my last comments:
I'm the person who can approve the change....
It should have said "I'm not the person who can approve..."... I now need to find the largest facepalm emote.
@egonelbre
Updates:-
Now I will run this implementation on the test suite you provided and try to complete all tests including modifications to the current implementation if required
@egonelbre Updates:-
By adding direct_send and an optimistic first pass read approach to the selector, I was able to improve its performance
Its now comparable to channels although still not as fast for large batch sizes Here are some basic benchmarks for selection
With Input Batch Size: 60 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 61.375µs
ZenQ Select Runner completed transfer in: 73.416µs
====================================================================
With Input Batch Size: 600 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 109.041µs
ZenQ Select Runner completed transfer in: 399.125µs
====================================================================
With Input Batch Size: 6000 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 896.917µs
ZenQ Select Runner completed transfer in: 3.627416ms
====================================================================
With Input Batch Size: 600000 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 129.959167ms
ZenQ Select Runner completed transfer in: 300.164333ms
====================================================================
@alphadose note, there's still a significant memory usage difference between zenq and channels. 5x for the usual case.
In terms of memory usage, it would be better to split it into 2 categories, initialization memory, and operational memory usage for fair comparison
In terms of initialization memory, channels are better and have a lower footprint
But in terms of operational memory usage ZenQ is better
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 104B ±54%
_ZenQ_NumWriters3_InputSize60000-8 22.2B ±91%
_Chan_NumWriters8_InputSize6000000-8 813B ±307%
_ZenQ_NumWriters8_InputSize6000000-8 690B ±115%
_Chan_NumWriters100_InputSize6000000-8 42.6kB ±36%
_ZenQ_NumWriters100_InputSize6000000-8 5.92kB ±118%
_Chan_NumWriters1000_InputSize7000000-8 475kB ±11%
_ZenQ_NumWriters1000_InputSize7000000-8 41.6kB ±34%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 47.4MB ± 9%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8 2.63 ±52%
_Chan_NumWriters100_InputSize6000000-8 157 ±17%
_ZenQ_NumWriters100_InputSize6000000-8 14.3 ±116%
_Chan_NumWriters1000_InputSize7000000-8 1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8 30.3 ±42%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 1.00M ± 0%
If you pay the initialization cost, you can gain a substantial reduction in operational memory usage
and the initialization cost is really cheap considering hardware nowadays
its the operational cost which needs to be reduced and optimised
There are many ephemeral channels for context cancellations (and similar) that are impacted by the memory usage. Of course, it's difficult to understand the real-world impact without testing. Based on my experience, small channels are more common than large channels that require high-throughput.
true, if throughput is less then it is better to use channels to save resources
but there also have been cases where this was insufficient in which scenarios users implemented their own queues
imo golang should also provide a throughput optimised communication mechanism
If 95% (just guessing) of the channel usage is non-throughput related then it needs to be carefully balanced with throughput. Similarly, the performance and memory usage of such use-cases is probably more important for the stdlib than high-workload and high-throughput.
Similarly, As you've demonstrated it's possible to have external library queues that trade-off memory for performance. So there's no pressing need to have it in the stdlib, however, I definitely agree it would be nice.
Actually, there is one more imp reason for making it a part of the std library
Currently, it is not possible to access the goroutine struct and other such things from external lib
In which case there are more optimisations that can be made making the current ZenQ even more efficient which wont be possible from external lib
Extracting the last amount of juice can only be done from golang internal lib and after this is done, no external lib will be able to match the performance of native (unless its a better algorithm)
@egonelbre performance improved a lot in the case of million goroutines
name time/op
_Chan_NumWriters1_InputSize600-8 23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8 38.7µs ± 1%
_Chan_NumWriters3_InputSize60000-8 5.48ms ± 1%
_ZenQ_NumWriters3_InputSize60000-8 2.63ms ± 1%
_Chan_NumWriters8_InputSize6000000-8 685ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8 254ms ± 3%
_Chan_NumWriters100_InputSize6000000-8 1.60s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8 298ms ± 1%
_Chan_NumWriters1000_InputSize7000000-8 1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8 409ms ± 1%
_Chan_Million_Blocking_Writers-8 10.5s ± 1%
_ZenQ_Million_Blocking_Writers-8 1.99s ±16%
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 17.5B ±163%
_ZenQ_NumWriters3_InputSize60000-8 13.4B ±348%
_Chan_NumWriters8_InputSize6000000-8 123B ±148%
_ZenQ_NumWriters8_InputSize6000000-8 545B ±56%
_Chan_NumWriters100_InputSize6000000-8 36.1kB ±49%
_ZenQ_NumWriters100_InputSize6000000-8 9.32kB ±32%
_Chan_NumWriters1000_InputSize7000000-8 479kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8 89.3kB ± 5%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 122MB ± 3%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 1.10 ±173%
_ZenQ_NumWriters8_InputSize6000000-8 3.19 ±57%
_Chan_NumWriters100_InputSize6000000-8 140 ±32%
_ZenQ_NumWriters100_InputSize6000000-8 21.8 ±33%
_Chan_NumWriters1000_InputSize7000000-8 1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8 46.5 ±27%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 1.00M ± 0%
@egonelbre Updates:-
operational memory usage reduced drastically (especially evident in the case of 1000 goroutines)
goos: darwin
goarch: arm64
name time/op
_Chan_NumWriters1_InputSize600-8 23.4µs ± 1%
_ZenQ_NumWriters1_InputSize600-8 17.7µs ± 0%
_Chan_NumWriters3_InputSize60000-8 5.54ms ± 5%
_ZenQ_NumWriters3_InputSize60000-8 2.63ms ± 2%
_Chan_NumWriters8_InputSize6000000-8 687ms ± 2%
_ZenQ_NumWriters8_InputSize6000000-8 243ms ± 4%
_Chan_NumWriters100_InputSize6000000-8 1.59s ± 4%
_ZenQ_NumWriters100_InputSize6000000-8 296ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8 1.97s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8 409ms ± 2%
_Chan_Million_Blocking_Writers-8 10.4s ± 4%
_ZenQ_Million_Blocking_Writers-8 1.83s ±10%
name alloc/op
_Chan_NumWriters1_InputSize600-8 0.00B
_ZenQ_NumWriters1_InputSize600-8 0.00B
_Chan_NumWriters3_InputSize60000-8 117B ±63%
_ZenQ_NumWriters3_InputSize60000-8 22.1B ±122%
_Chan_NumWriters8_InputSize6000000-8 1.01kB ±196%
_ZenQ_NumWriters8_InputSize6000000-8 1.12kB ±89%
_Chan_NumWriters100_InputSize6000000-8 42.6kB ±37%
_ZenQ_NumWriters100_InputSize6000000-8 11.3kB ±28%
_Chan_NumWriters1000_InputSize7000000-8 481kB ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8 90.5kB ± 6%
_Chan_Million_Blocking_Writers-8 553MB ± 0%
_ZenQ_Million_Blocking_Writers-8 123MB ± 4%
name allocs/op
_Chan_NumWriters1_InputSize600-8 0.00
_ZenQ_NumWriters1_InputSize600-8 0.00
_Chan_NumWriters3_InputSize60000-8 0.00
_ZenQ_NumWriters3_InputSize60000-8 0.00
_Chan_NumWriters8_InputSize6000000-8 3.43 ±162%
_ZenQ_NumWriters8_InputSize6000000-8 5.23 ±53%
_Chan_NumWriters100_InputSize6000000-8 158 ±20%
_ZenQ_NumWriters100_InputSize6000000-8 26.3 ±29%
_Chan_NumWriters1000_InputSize7000000-8 1.76k ± 2%
_ZenQ_NumWriters1000_InputSize7000000-8 48.3 ±28%
_Chan_Million_Blocking_Writers-8 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8 1.00M ± 0%
Here are a few more benchmarks for different environments/systems
goos: windows
goarch: amd64
cpu: AMD Ryzen 7 5800H with Radeon Graphics
name time/op _Chan_NumWriters1_InputSize600-16 24.5µs ± 5% _ZenQ_NumWriters1_InputSize600-16 17.7µs ± 2% _Chan_NumWriters3_InputSize60000-16 4.75ms ± 3% _ZenQ_NumWriters3_InputSize60000-16 1.86ms ± 1% _Chan_NumWriters8_InputSize6000000-16 800ms ± 5% _ZenQ_NumWriters8_InputSize6000000-16 150ms ± 1% _Chan_NumWriters100_InputSize6000000-16 1.66s ± 1% _ZenQ_NumWriters100_InputSize6000000-16 160ms ± 1% _Chan_NumWriters1000_InputSize7000000-16 1.95s ± 1% _ZenQ_NumWriters1000_InputSize7000000-16 269ms ± 1% _Chan_Million_Blocking_Writers-16 5.79s ± 2% _ZenQ_Million_Blocking_Writers-16 1.37s ± 6%
name alloc/op _Chan_NumWriters1_InputSize600-16 0.00B _ZenQ_NumWriters1_InputSize600-16 0.00B _Chan_NumWriters3_InputSize60000-16 150B ±57% _ZenQ_NumWriters3_InputSize60000-16 20.6B ±201% _Chan_NumWriters8_InputSize6000000-16 472B ±283% _ZenQ_NumWriters8_InputSize6000000-16 1.05kB ±58% _Chan_NumWriters100_InputSize6000000-16 43.7kB ±38% _ZenQ_NumWriters100_InputSize6000000-16 29.7kB ±17% _Chan_NumWriters1000_InputSize7000000-16 484kB ± 7% _ZenQ_NumWriters1000_InputSize7000000-16 120kB ±14% _Chan_Million_Blocking_Writers-16 553MB ± 0% _ZenQ_Million_Blocking_Writers-16 128MB ± 4%
name allocs/op _Chan_NumWriters1_InputSize600-16 0.00 _ZenQ_NumWriters1_InputSize600-16 0.00 _Chan_NumWriters3_InputSize60000-16 0.00 _ZenQ_NumWriters3_InputSize60000-16 0.00 _Chan_NumWriters8_InputSize6000000-16 2.00 ±150% _ZenQ_NumWriters8_InputSize6000000-16 3.90 ±28% _Chan_NumWriters100_InputSize6000000-16 148 ±34% _ZenQ_NumWriters100_InputSize6000000-16 68.3 ±24% _Chan_NumWriters1000_InputSize7000000-16 1.79k ± 5% _ZenQ_NumWriters1000_InputSize7000000-16 62.3 ±36% _Chan_Million_Blocking_Writers-16 2.00M ± 0% _ZenQ_Million_Blocking_Writers-16 1.00M ± 0%
2. ubuntu, amd64, 16 cores
goos: linux goarch: amd64 cpu: AMD Ryzen 7 5800H with Radeon Graphics
name time/op _Chan_NumWriters1_InputSize600-16 23.4µs ± 4% _ZenQ_NumWriters1_InputSize600-16 33.1µs ± 4% _Chan_NumWriters3_InputSize60000-16 2.59ms ± 3% _ZenQ_NumWriters3_InputSize60000-16 1.79ms ± 1% _Chan_NumWriters8_InputSize6000000-16 334ms ± 6% _ZenQ_NumWriters8_InputSize6000000-16 162ms ± 4% _Chan_NumWriters100_InputSize6000000-16 515ms ± 6% _ZenQ_NumWriters100_InputSize6000000-16 170ms ± 3% _Chan_NumWriters1000_InputSize7000000-16 1.76s ± 3% _ZenQ_NumWriters1000_InputSize7000000-16 273ms ± 2% _Chan_Million_Blocking_Writers-16 4.52s ± 5% _ZenQ_Million_Blocking_Writers-16 1.27s ±14%
name alloc/op _Chan_NumWriters1_InputSize600-16 0.00B _ZenQ_NumWriters1_InputSize600-16 0.00B _Chan_NumWriters3_InputSize60000-16 91.7B ±51% _ZenQ_NumWriters3_InputSize60000-16 4.00B ± 0% _Chan_NumWriters8_InputSize6000000-16 487B ±275% _ZenQ_NumWriters8_InputSize6000000-16 879B ±111% _Chan_NumWriters100_InputSize6000000-16 30.0kB ±47% _ZenQ_NumWriters100_InputSize6000000-16 23.2kB ±54% _Chan_NumWriters1000_InputSize7000000-16 463kB ±11% _ZenQ_NumWriters1000_InputSize7000000-16 129kB ±10% _Chan_Million_Blocking_Writers-16 553MB ± 0% _ZenQ_Million_Blocking_Writers-16 124MB ± 3%
name allocs/op _Chan_NumWriters1_InputSize600-16 0.00 _ZenQ_NumWriters1_InputSize600-16 0.00 _Chan_NumWriters3_InputSize60000-16 0.00 _ZenQ_NumWriters3_InputSize60000-16 0.00 _Chan_NumWriters8_InputSize6000000-16 1.57 ±219% _ZenQ_NumWriters8_InputSize6000000-16 3.48 ±44% _Chan_NumWriters100_InputSize6000000-16 87.8 ±40% _ZenQ_NumWriters100_InputSize6000000-16 54.3 ±54% _Chan_NumWriters1000_InputSize7000000-16 1.67k ± 9% _ZenQ_NumWriters1000_InputSize7000000-16 63.5 ±10% _Chan_Million_Blocking_Writers-16 2.00M ± 0% _ZenQ_Million_Blocking_Writers-16 1.00M ± 0%
I would be interested to see these benchmark results on other architectures, and those with less resources to work with. As a Ryzen 5800H is quite beefy compared to a lot of the systems these could be running on. Your implementation scales much better on the larger end of input/writers, but I'd be interested how the initialization overhead scales down on more constrained systems.
cool, I was collecting benchmarks for all kinds of systems and will definitely benchmark this in low end systems like raspberry pi.
I am gathering all benchmarks here
@NyaaaWhatsUpDoc apologies for the delay, here are the benchmarks for a low end device (32 bit raspberry pi)
Device Info
`.::///+:/-. --///+//-:`` alphadose@neverwinter
`+oooooooooooo: `+oooooooooooo: -------------------
/oooo++//ooooo: ooooo+//+ooooo. OS: Raspbian GNU/Linux 11 (bullseye) armv7l
`+ooooooo:-:oo- +o+::/ooooooo: Host: Raspberry Pi 4 Model B Rev 1.5
`:oooooooo+`` `.oooooooo+- Kernel: 5.15.32-v7l+
`:++ooo/. :+ooo+/.` Uptime: 1 hour, 58 mins
...` `.----.` ``.. Packages: 569 (dpkg)
.::::-``:::::::::.`-:::-` Shell: bash 5.1.4
-:::-` .:::::::-` `-:::- Terminal: /dev/pts/0
`::. `.--.` `` `.---.``.::` CPU: BCM2711 (4) @ 1.800GHz
.::::::::` -::::::::` ` Memory: 68MiB / 3838MiB
.::` .:::::::::- `::::::::::``::.
-:::` ::::::::::. ::::::::::.`:::-
:::: -::::::::. `-:::::::: ::::
-::- .-:::-.``....``.-::-. -::-
.. `` .::::::::. `..`..
-:::-` -::::::::::` .:::::`
:::::::` -::::::::::` :::::::.
.::::::: -::::::::. ::::::::
`-:::::` ..--.` ::::::.
`...` `...--..` `...`
.::::::::::
`.-::::-`
Benchstat of 20 cases
goos: linux
goarch: arm
name time/op
_Chan_NumWriters1_InputSize600-4 230µs ± 4%
_ZenQ_NumWriters1_InputSize600-4 186µs ± 5%
_Chan_NumWriters3_InputSize60000-4 28.2ms ± 3%
_ZenQ_NumWriters3_InputSize60000-4 12.8ms ± 0%
_Chan_NumWriters8_InputSize6000000-4 4.14s ±10%
_ZenQ_NumWriters8_InputSize6000000-4 1.32s ± 1%
_Chan_NumWriters100_InputSize6000000-4 5.97s ± 5%
_ZenQ_NumWriters100_InputSize6000000-4 1.48s ± 5%
_Chan_NumWriters1000_InputSize7000000-4 7.23s ± 6%
_ZenQ_NumWriters1000_InputSize7000000-4 2.09s ± 4%
_Chan_Million_Blocking_Writers-4 20.3s ± 2%
_ZenQ_Million_Blocking_Writers-4 6.96s ± 4%
name alloc/op
_Chan_NumWriters1_InputSize600-4 0.00B
_ZenQ_NumWriters1_InputSize600-4 0.00B
_Chan_NumWriters3_InputSize60000-4 227B ±27%
_ZenQ_NumWriters3_InputSize60000-4 77.9B ±91%
_Chan_NumWriters8_InputSize6000000-4 499B ±189%
_ZenQ_NumWriters8_InputSize6000000-4 1.49kB ± 4%
_Chan_NumWriters100_InputSize6000000-4 27.5kB ±19%
_ZenQ_NumWriters100_InputSize6000000-4 27.7kB ±42%
_Chan_NumWriters1000_InputSize7000000-4 290kB ± 5%
_ZenQ_NumWriters1000_InputSize7000000-4 135kB ± 8%
_Chan_Million_Blocking_Writers-4 325MB ± 0%
_ZenQ_Million_Blocking_Writers-4 76.2MB ± 3%
name allocs/op
_Chan_NumWriters1_InputSize600-4 0.00
_ZenQ_NumWriters1_InputSize600-4 0.00
_Chan_NumWriters3_InputSize60000-4 1.00 ± 0%
_ZenQ_NumWriters3_InputSize60000-4 0.00
_Chan_NumWriters8_InputSize6000000-4 4.30 ±109%
_ZenQ_NumWriters8_InputSize6000000-4 19.2 ± 9%
_Chan_NumWriters100_InputSize6000000-4 171 ±13%
_ZenQ_NumWriters100_InputSize6000000-4 194 ±25%
_Chan_NumWriters1000_InputSize7000000-4 1.84k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-4 1.09k ± 4%
_Chan_Million_Blocking_Writers-4 2.00M ± 0%
_ZenQ_Million_Blocking_Writers-4 1.00M ± 0%
We can see that ZenQ scales well in both low end as well as high end devices
ZenQ's select{}
performance has improved too since last time
❯ go run benchmarks/selector/main.go
With Input Batch Size: 60 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 52.084µs
ZenQ Select Runner completed transfer in: 60.584µs
====================================================================
With Input Batch Size: 600 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 160.708µs
ZenQ Select Runner completed transfer in: 349.792µs
====================================================================
With Input Batch Size: 6000 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 2.030959ms
ZenQ Select Runner completed transfer in: 3.499ms
====================================================================
With Input Batch Size: 600000 and Num Concurrent Writers: 4
Chan Select Runner completed transfer in: 174.827458ms
ZenQ Select Runner completed transfer in: 226.949375ms
====================================================================
For selection ZenQ uses golang interfaces which internally work via dynamic dispatch, this cost adds up and attributes to the slower performance as compared to channels.
Current implementation of channels could use some improvements in terms of
ns/op
,B/op
andallocs/op
I have made a POC thread-safe queue which has better performance in all the above 3 metrics https://github.com/alphadose/ZenQ
Here are the benchmarks vs native channels https://github.com/alphadose/ZenQ#benchmarks for a wide range of use cases
My proposal is to make the underlying implementation of native channels more efficient by changing the algorithm and data structure to the lock-free one used in ZenQ or some other suitable implementation (debatable)