Open petermattis opened 7 years ago
I haven't looked at the code. Is this all calls from Go into C? Are there any callbacks from C to Go?
You say that the "Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB". Combining that with the number of operations being done per second, would it be accurate to say that the C code itself, after called from Go, takes more than 20 microseconds to run?
In Go 1.8 when a goroutine calls into C, it is still holding a GOMAXPROCS slot and blocking other goroutines from running. Only if it is running in C for more than 20 microseconds or so will the system monitor thread decide that it is blocked in C code and activate another goroutine. The fact that your system performs better than you increase GOMAXPROCS makes me suspect that there is something to improve in that area of the code.
Is this all calls from Go into C? Are there any callbacks from C to Go?
The code is all calls from Go into C. There are no callbacks from C to Go. This is true of both the reproduction code and CockroachDB.
You say that the "Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB". Combining that with the number of operations being done per second, would it be accurate to say that the C code itself, after called from Go, takes more than 20 microseconds to run?
Yes. The actual RocksDB write batch operation involves a disk write, sometimes followed by an fsync. batchtest
simulates this with a 5 millisecond sleep.
Only if it is running in C for more than 20 microseconds or so will the system monitor thread decide that it is blocked in C code and activate another goroutine.
I've been perusing that Go scheduler code. The 20us time is the minimum time sysmon will sleep for. The maximum delay is 10ms and a small bit of instrumentation (measuring the time around the usleep
call) showed an actual maximum delay of 17ms, though I'm not sure how often that occurs.
Changing the max sysmon
delay to 1ms results in a significant improvement to batchtest
:
~/Development/go/src/github.com/petermattis/batchtest master ./batchtest -t cgo
_elapsed____ops/sec
1s 7493.6
2s 7279.6
3s 7217.8
4s 7498.1
5s 7898.1
6s 7899.2
7s 7799.3
8s 7798.7
9s 7698.1
10s 7395.3
Unfortunately, this didn't result in any improvement to the cockroach
performance numbers, so perhaps batchtest
isn't completely capturing whatever is going on.
Let me chime in. ( I came from https://github.com/golang/go/issues/21827#issuecomment-366332843 and the sympthoms are similar to what is described here so I decided to continue this thread instead of creating new issue. Please correct me if I am wrong. )
First some context:
Similarly to Cockroach who is using CGo wrappers for RocksDB I observed performance anomalies while using CGo wrappers for SQLite where presense of other goroutines combined with Cgo calls on "main" one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call sqlite3_interrupt
if context is canceled.
With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (https://github.com/mattn/go-sqlite3/pull/530).
However Query was made faster only by 5% because after 2b283ce database/sql itself always spawns a goroutine to close Rows on context or transaction cancel. Avoiding to spawn Rows.awaitDone
goroutine brings ~ 1.5x speedup to Query too (https://github.com/golang/go/issues/23879).
The theme here is that CGo calls show much higher overhead when performed not under simple only-one gorouitine scenario but in the presence of other goroutines. So let's make some analysis:
First, on my machine I can observe that under simple 1-goroutine-only scenario the cost of making 1 trivial CGo call is ~ 60ns, it scales linearly to the number of calls, it is independent of whether GOMAXPROCS is 1 or not, and it is ~ 35x slower compared to making a trivial Go call:
CallGo1 1.71ns ± 0%
CallGo1-4 1.71ns ± 0%
CallGo10 16.0ns ± 0%
CallGo10-4 16.1ns ± 0%
CallCGo1 58.3ns ± 1%
CallCGo1-4 58.6ns ± 1%
CallCGo10 626ns ± 0%
CallCGo10-4 611ns ± 5%
Then we can check what happens in the presence of other goroutines by simulating client-server requests over channel with client and server running in its own goroutines and server sequentially handling requests (so 2 goroutines in total) via making 1 trivial either Go or CGo call:
RTTSeqServerGo1 396ns ± 0%
RTTSeqServerGo1-4 458ns ± 0%
RTTSeqServerCGo1 454ns ± 0%
RTTSeqServerCGo1-4 536ns ± 1%
what can be seen here is that for GOMAXPROCS=1 δ(Go1,CGo1) ≈ t(CGo call), both Go and CGo RTT times are bigger for GOMAXPROCS!=1 with δ(Go1-4,CGo1-4) also growing to ~80ns.
It is somewhat understandable that processing time increases with GOMAXPROCS!=1 even for Go-only case because sometimes there now can be inter OS-threads communications, so in the context of this issue there is probably no problem here (though those 2 goroutines run almost not overlapping with each other and thus there should not be a need to put them to different OS threads). However δ(Go1,CGo1) increase over GOMAXPROCS=1 case suggests there might be something else going on. 20ns is however small and might be all noise, so let's recheck what happens when server performs 10 calls in the handler:
RTTSeqServerGo10 419ns ± 1%
RTTSeqServerGo10-4 486ns ± 1%
RTTSeqServerCGo10 1.00µs ± 0%
RTTSeqServerCGo10-4 1.97µs ± 4%
Go10, Go10-4 and CGo10 show time increase proportional to ~ 9·t(corresponding call). However CGo10-4 (i.e. CGo case with GOMAXPROCS!=1) is now 2x slower than CGo10 (i.e. the same case with GOMAXPROCS=1) by ~1µs.
The difference in between CGo10 and CGo10-4 can be attributed to inter OS-thread communications: the strace for CGo10 case is almost empty and contains only 10ms sleeps from sysmon during active phase:
1403 14:41:40.757900 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403 14:41:40.767974 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403 14:41:40.778049 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
...
However CGo10-4 shows different picture under strace - during active phase there is now constant high-rate churn of futex wait and wake in between 2 OS threads with 1-10 millisecond sleeps from sysmon also showing seldomly:
1447 14:42:41.733626 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452 14:42:41.733631 <... futex resumed> ) = 0
1447 14:42:41.733636 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452 14:42:41.733645 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447 14:42:41.733664 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452 14:42:41.733669 <... futex resumed> ) = 0
1447 14:42:41.733674 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452 14:42:41.733697 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447 14:42:41.733703 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452 14:42:41.733708 <... futex resumed> ) = 0
1447 14:42:41.733713 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452 14:42:41.733722 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447 14:42:41.733728 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452 14:42:41.733733 <... futex resumed> ) = 0
1447 14:42:41.733738 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452 14:42:41.733747 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447 14:42:41.733753 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452 14:42:41.733758 <... futex resumed> ) = 0
1447 14:42:41.733763 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452 14:42:41.733772 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
...
# seldom
1446 14:42:41.225481 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446 14:42:41.235566 <... pselect6 resumed> ) = 0 (Timeout)
1446 14:42:41.235594 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446 14:42:41.245656 <... pselect6 resumed> ) = 0 (Timeout)
1446 14:42:41.245670 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
...
This strace suggests that for CGo10-4 the scheduler somehow decided to keep one goroutine on one OS thread and another goroutine on another OS thread and then channel communications in between them somehow becomes futex ops which bring the overhead.
The draft notesleep spinning patch from https://github.com/golang/go/issues/21827#issue-256508518 can mitigate the CGo10-4 problem somewhat:
Go1.10 Go1.10+notesleep-spin
name old time/op new time/op delta
RTTSeqServerGo1 397ns ± 0% 405ns ± 0% +2.01% (p=0.000 n=9+10)
RTTSeqServerGo1-4 461ns ± 1% 462ns ± 0% ~ (p=0.640 n=10+9)
RTTSeqServerCGo1 456ns ± 0% 466ns ± 0% +2.17% (p=0.000 n=9+9)
RTTSeqServerCGo1-4 532ns ± 1% 532ns ± 1% ~ (p=0.751 n=10+10)
RTTSeqServerGo10 419ns ± 1% 424ns ± 0% +1.17% (p=0.000 n=10+10)
RTTSeqServerGo10-4 486ns ± 1% 484ns ± 1% -0.51% (p=0.018 n=10+10)
RTTSeqServerCGo10 1.00µs ± 0% 1.02µs ± 0% +1.44% (p=0.000 n=9+10)
RTTSeqServerCGo10-4 1.97µs ± 4% 1.13µs ± 1% -42.92% (p=0.000 n=10+10)
However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision: the individual CGo call time is well under 20µs - the time, if I understood correctly, after which sysmon decides that something is hanging in a CGo or syscall for too much and removes current OS thread from a GOMAXPROCS slot. The fact that sysmon is sleeping for >1ms in strace also confirm that sysmon does not intervene here.
I do not know Go scheduler insides, but clearly making several CGo call in sequence shakes it enough to bring some disorder and slowdown. For the reference when I was seeing the problem with SQLite my query time was ~ 5-6µs if I use the go driver directly and with standard
go func() {
select {
case <-ctx.Done():
// call some cancel
case <-done:
// work finished ok
}
}()
// perform query
close(done)
spawned around query with ctx never being canceled it was increasing the time to ~10µs. In other words it can be not only 1µs of additional penalty but higher.
In summary, like @petermattis said it looks like CGo calls sequence trigers some badness in Go scheduler that would be good to understand and hopefully fix.
Thanks beforehand, Kirill
/cc @ianlancetaylor, @dvyukov, @rsc, @aclements, @bcmills, @mattn
(cgo.go)
package cgotest
// int argadjust(int arg) { return 3 + arg; }
import "C"
// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
return int(C.argadjust(C.int(arg)))
}
(cgo_test.go)
package cgotest
import "testing"
//go:noinline
func goargadjust(i int) int {
return i + 3
}
// BenchmarkCallGo1 measures overhead of 1 Go call.
func BenchmarkCallGo1(b *testing.B) {
for i := 0; i < b.N; i++ {
x := goargadjust(i)
if x != i + 3 {
b.Fatalf("goargadjust -> %d ; want %d", x, i + 3)
}
}
}
// BenchmarkCallGo10 measures overhead of 10 Go calls performed sequentially.
func BenchmarkCallGo10(b *testing.B) {
for i := 0; i < b.N; i++ {
for j := 0; j < 10; j++ {
x := goargadjust(i)
if x != i + 3 {
b.Fatalf("goargadjust -> %d ; want %d", x, i + 3)
}
}
}
}
// BenchmarkCallCGo1 measures overhead of 1 CGo call.
func BenchmarkCallCGo1(b *testing.B) {
for i := 0; i < b.N; i++ {
x := cargadjust(i)
if x != i + 3 {
b.Fatalf("cargadjust -> %d ; want %d", x, i + 3)
}
}
}
// BenchmarkCallCGo10 measures overhead of 10 CGo calls performed sequentially.
func BenchmarkCallCGo10(b *testing.B) {
for i := 0; i < b.N; i++ {
for j := 0; j < 10; j++ {
x := cargadjust(i)
if x != i + 3 {
b.Fatalf("cargadjust -> %d ; want %d", x, i + 3)
}
}
}
}
// ---------
type req struct {
respq chan *resp
arg int
}
type resp struct {
ret int
}
// client performs one request-reply cycle to a server over srvlink.
func client(srvlink chan *req, arg int) int {
rc := make(chan *resp)
srvlink <- &req{
respq: rc,
arg: arg,
}
ret := <-rc
return ret.ret
}
// server receives requests over inq, passes received request to
// handler, and returns result back over response channel.
//
// server runs until it processes all requests from inq.
type server func(inq chan *req, handler func(int) int)
// seqServer sequentially handles requests received over inq.
func seqServer(inq chan *req, handler func(int) int) {
for r := range inq {
r.respq <- &resp{ret: handler(r.arg)}
}
}
// goServer concurently handles requests received over inq.
func goServer(inq chan *req, handler func(int) int) {
for r := range inq {
r := r
go func() {
r.respq <- &resp{ret: handler(r.arg)}
}()
}
}
// benchmarkRTT measures client-server round-trip latency for a particular
// server implementation.
func benchmarkRTT(b *testing.B, srv func(chan *req)) {
inc := make(chan *req)
go srv(inc)
for i := 0; i < b.N; i++ {
client(inc, i)
}
close(inc)
}
// --------
// mkServerGo1 returns function that will runs srv with N Go calls per request.
func mkServerGo1N(srv server, n int) func(chan *req) {
return func(inq chan *req) {
srv(inq, func(arg int) int {
for i := 0; i < n; i++ {
arg = goargadjust(arg)
}
return arg
})
}
}
// mkServerCGoN returns function that will run srv with N CGo calls per request.
func mkServerCGoN(srv server, n int) func(chan *req) {
return func(inq chan *req) {
srv(inq, func(arg int) int {
for i := 0; i < n; i++ {
arg = cargadjust(arg)
}
return arg
})
}
}
func BenchmarkRTTSeqServerGo1(b *testing.B) { benchmarkRTT(b, mkServerGo1N(seqServer, 1)) }
//func BenchmarkRTTGoServerGo1(b *testing.B) { benchmarkRTT(b, mkServerGo1(goServer)) }
func BenchmarkRTTSeqServerCGo1(b *testing.B) { benchmarkRTT(b, mkServerCGoN(seqServer, 1)) }
//func BenchmarkRTTGoServerCGo1(b *testing.B) { benchmarkRTT(b, mkServerCGoN(goServer, 1)) }
func BenchmarkRTTSeqServerGo10(b *testing.B) { benchmarkRTT(b, mkServerGo1N(seqServer, 10)) }
func BenchmarkRTTSeqServerCGo10(b *testing.B) { benchmarkRTT(b, mkServerCGoN(seqServer, 10)) }
//func BenchmarkRTTGoServerCGo10(b *testing.B) { benchmarkRTT(b, mkServerCGoN(goServer, 10)) }
However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision
From run time perspective both RTTSeqServerCGo1 and RTTSeqServerCGo10 do exactly the same -- call a long sequences of cgo calls. From run time perspective the inner loop does not make any difference. So there is something else happening. CPU scaling/turbo/overheat? Have you tried running them for longer? in isolation? with CPU affinity? turning off CPU scaling?
@dvyukov thanks for feedback.
From run time perspective both RTTSeqServerCGo1 and RTTSeqServerCGo10 do exactly the same -- call a long sequences of cgo calls.
There is a difference - CGo1 case is like this:
Gclient Gserver
srvlink <- req
req <- srvlink
CGo call x1
resp <- ...
resp <- ...
while CGo10 is like:
Gclient Gserver
srvlink <- req
req <- srvlink
CGo call x10
resp <- ...
resp <- ...
every CGo call triggers at least runtime.entersyscall / runtime.exitsyscall so there is a difference in runtime call sequences in between CGo1 and CGo10 - channel sends operations are also there and the difference is in how many cgo calls are made in between server receives a request and sends a reply for it back.
I was running the tests with cpupower frequency-set -g performance
and monitoring dmesg -w
for overheating and that CPU throttling is not getting in. I've also reverified timings several times via running each test individually with a pause (so CPU can get calmer) in between different runs.
The only mistake from me in previous runs was that the turbo was not disabled so now to recheck I disable it with cpupower frequency-set --min 2.6Ghz --max 2.6GHz
(2.6GHz is max non-turbo frequency of my cpu) and show you below a complete dump of a relatively long run where different tests are intermixed with each other so that if e.g. overheating or something else gets in it should be visible to both CGo1 and CGo10:
$ ./neotest info-local
# Sun, 18 Feb 2018 17:04:37 +0300
# kirr@deco.navytux.spb.ru (2401:5180:0:37::1 192.168.0.2)
# Linux deco 4.14.0-3-amd64 #1 SMP Debian 4.14.13-1 (2018-01-14) x86_64 GNU/Linux
# cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
# cpu[0-3]: freq: intel_pstate/performance [2.60GHz - 2.60GHz]
# cpu[0-3]: idle: intel_idle/menu: POLL(0μs) C1(2μs) C1E(10μs) C3(70μs) C6(85μs) C7s(124μs) C8(200μs) C9(480μs) C10(890μs)
# cpu: WARNING: C-state exit-latency is max 890μs - up to that can add to networked and IPC request-reply latency
# sda: SanDisk X400 M.2 rev 0012 477G
# eth0: Intel Corporation Ethernet Connection I219-LM rev 21
# eth0: features: rx tx sg tso !ufo gso gro !lro rxvlan txvlan !ntuple rxhash ...
# eth0: coalesce: rxc: 3μs/0f/0μs-irq/0f-irq, txc: 0μs/0f/0μs-irq/0f-irq
# eth0: up, speed=1000, mtu=1500, txqlen=1000, gro_flush_timeout=0.000µs
# wlan0: Intel Corporation Wireless 8260 rev 3a
# wlan0: features: !rx !tx sg !tso !ufo gso gro !lro !rxvlan !txvlan !ntuple !rxhash ...
# wlan0: coalesce: rxc: ?, txc: ?
# wlan0: down, speed=?, mtu=1500, txqlen=1000, gro_flush_timeout=0.000µs
# wlan0: WARNING: TSO not enabled - TCP latency with packets > MSS will be poor
# Python 2.7.14
# go version go1.10 linux/amd64
# sqlite 3.22.0 (py mod 2.6.0)
# mysqld Ver 10.1.29-MariaDB-6 for debian-linux-gnu on x86_64 (Debian buildd-unstable)
# neo : v1.8.1-1532-gc0502d41-dirty
# zodb : 5.3.0-11-gde1f24ca0
# zeo : 5.1.1-1-gfdcc2739
# mysqlclient : 1.3.12
# wendelin.core : v0.11-14-gc3cc8a9
$ time bash -c 'for in in `seq 10`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s -test.cpu 1,4; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 592 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 640 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1302 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2499 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 592 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 652 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1319 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2387 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 592 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 640 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1303 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2489 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 20000000 604 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 651 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1335 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2464 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 595 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 649 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1329 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2423 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 20000000 601 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 646 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1300 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2392 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 20000000 603 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 640 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1309 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2482 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 20000000 601 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 647 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1313 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2439 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 597 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 647 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1306 ns/op
BenchmarkRTTSeqServerCGo10-4 10000000 2451 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1 30000000 597 ns/op
BenchmarkRTTSeqServerCGo1-4 20000000 656 ns/op
BenchmarkRTTSeqServerCGo10 10000000 1312 ns/op
BenchmarkRTTSeqServerCGo10-4 5000000 2424 ns/op
PASS
real 10m1,226s
user 11m14,922s
sys 0m42,169s
i.e. the timings are stable, there is no outliers and it is
RTTSeqServerCGo1 597ns ± 1%
RTTSeqServerCGo1-4 647ns ± 1%
RTTSeqServerCGo10 1.31µs ± 2%
RTTSeqServerCGo10-4 2.44µs ± 2%
this covers "scaling/turbo/overheat and "running for longer".
Now let's check setting affinity. First: just setting GOMAXPROCS=2 does not make much difference to default GOMAXPROCS=4:
$ time bash -c 'for in in `seq 3`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s -test.cpu 2; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 647 ns/op
BenchmarkRTTSeqServerCGo10-2 5000000 2682 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 655 ns/op
BenchmarkRTTSeqServerCGo10-2 5000000 2615 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 655 ns/op
BenchmarkRTTSeqServerCGo10-2 5000000 2551 ns/op
PASS
real 1m28,294s
user 1m47,062s
sys 0m14,043s
but CGo10-2 overhead goes away if I set test process affinity to only two first CPUs:
$ time taskset -c 0,1 bash -c 'for in in `seq 3`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 608 ns/op
BenchmarkRTTSeqServerCGo10-2 10000000 1364 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 619 ns/op
BenchmarkRTTSeqServerCGo10-2 10000000 1369 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2 20000000 604 ns/op
BenchmarkRTTSeqServerCGo10-2 10000000 1376 ns/op
PASS
real 1m23,806s
user 1m28,831s
sys 0m1,393s
probably because setting affinity removes many context switches:
$ perf stat ./cgo.test -test.bench "RTTSeqServerCGo10" -test.benchtime 10s
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo10-4 10000000 1741 ns/op
PASS
Performance counter stats for './cgo.test -test.bench RTTSeqServerCGo10 -test.benchtime 10s':
24547,864730 task-clock (msec) # 1,280 CPUs utilized
1 968 952 context-switches # 0,080 M/sec
745 cpu-migrations # 0,030 K/sec
1 824 page-faults # 0,074 K/sec
59 997 399 275 cycles # 2,444 GHz
95 671 377 235 instructions # 1,59 insn per cycle
19 118 977 436 branches # 778,845 M/sec
76 635 472 branch-misses # 0,40% of all branches
19,172156094 seconds time elapsed
$ taskset -c 0,1 perf stat ./cgo.test -test.bench "RTTSeqServerCGo10" -test.benchtime 10s
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo10-2 10000000 1459 ns/op
PASS
Performance counter stats for './cgo.test -test.bench RTTSeqServerCGo10 -test.benchtime 10s':
18226,710279 task-clock (msec) # 1,136 CPUs utilized
865 349 context-switches # 0,047 M/sec
1 729 cpu-migrations # 0,095 K/sec
1 797 page-faults # 0,099 K/sec
45 725 017 537 cycles # 2,509 GHz
85 013 647 314 instructions # 1,86 insn per cycle
17 127 907 786 branches # 939,715 M/sec
72 766 706 branch-misses # 0,42% of all branches
16,045203680 seconds time elapsed
and it was already discussed in the second half of https://github.com/golang/go/issues/21827#issue-256508518 how futex_wake
can be changing CPU of a woken process.
HOWEVER
Setting affinity cannot work in practice - if we imagine a server which for every request spawns separate goroutine, and the CGo calls have to be done from under that request-handling goroutine, we now have N(goroutines) > Ncpu. Since we cannot set affinity of a goroutine - at least we cannot do so at present with public API - the only choise is to let the whole process use all availble CPU resources dedicated to service and oops -> context switches could be happenning all the time.
There is a difference - CGo1 case is like this:
Ah, sorry. I misread the test. I thought it's just cgo calls.
np, and thanks for feedback. Looking forward to what might be done about it.
There were several projects which were being rewritten in C or C++, because of cgo overhead. One example : https://github.com/canonical/dqlite/blob/master/doc/faq.md#why-c When I tried to introduce go to my team, same problem (In 2018).
Can somebody try to work on this?
Thank you.
Is there any update on this? I'm also having this problem too.
For my use case, I'm using https://github.com/cisco/openh264, which is written in C++, to encode every frame that I get from my camera through https://en.wikipedia.org/wiki/Video4Linux.
Following is the output from pprof
There are no updates to this issue. If there are updates, they will be reported here.
Note that your profile is not showing you that runtime.cgocall
itself is very slow, it's showing you that your C code overall is taking up 77% of your program's execution time. The Go profiler won't by default tell you anything about C code. If it would be useful for you you can try importing github.com/ianlancetaylor/cgosymbolizer to get C code profiling; that is completely unsupported but it may help. It requires that your C code be compiled with the -g
option.
Ah I see, I'm sorry for the misunderstanding. I'll try to import github.com/ianlancetaylor/cgosymbolizer. Thanks.
FYI I ran the test in the OP on 1.17 & cgo performance roughly matches go performance (and may be slightly better?)
I also ran navytux's test, and the cgo overhead does still roughly double when GOMAXPROCS>1 when there are go operations in flight for his RTT tests, but no evidence of anything with as large of impact as the OP.
I do think the overhead, such as it is, is likely related to #21827. Goroutines that interact with cgo are likely to end up, via brownian motion, in the same position as an OS-locked thread: a single G on a single M/P. When the thread parks to collect info, its M is parked in turn. We don't end up in this case when GOMAXPROCS=1, because cgo can't subtly move the G onto its own M/P, it has to keep squishing the cgo G back into the single P when it comes back from syscall, so the subsequent G park when it reads off channel doesn't cause problems.
I think broadly there's an issue where parking an M is intended to happen when there is clearly not enough work around to sustain an entire M, but there are a few cases (os-locked threads, threads that talk to cgo a lot) where there's good reason for a G to be hanging out on its own. It would probably be better if os-locked threads never parked their M and then we could os-lock threads if we are talking to cgo a lot and have a problem with this behavior.
EDIT: Alternatively go could just be way more amenable to cramming G's coming back from syscall into an existing P but the performance impact of that seems hard to predict.
Here is a new scenario that make me concern the performance a lot: Conn
Read
/Write
on Windows, which calls API WSARecv
/WSASend
according runtime.cgocall
. It is a very basic functionality for web servers, proxies, and other network data transferring tools.
Low performance will pop up when there is a large amount of data, with high CPU usage (10%+, task manager) which could be 10+ times than rust/c developed similar apps.
An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:
It is really a pity to see it is unplanned! Many people have noticed it and researched on it. Hoping it could draw more attentions and more improvements could be achieved, earlier.
Would probably help if you got a self-contained example like in the OP and opened a new issue. The example in this one no longer produces the results linked in the OP, so I would not expect to see this task addressed, but your new one might be.
@CannibalVox Thank you for your advice! I have create https://github.com/golang/go/issues/58336.
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.8 darwin/amd64
, though I've also verified this happens on Linux.What operating system and processor architecture are you using (
go env
)?Background
CockroachDB internally uses RocksDB (a fork of LevelDB) for data storage. We access RocksDB via a small Cgo wrapper. All writes to RocksDB are performed using
WriteBatches
which are first written to a write-ahead-log and then inserted into the database. Internally, RocksDB "groups" multiple concurrent write batches together into a single append to the write-ahead-log. While investigating performance, we noticed that the size of the groups was smaller than expected.What did you do?
TL;DR? If write batch grouping is performed in Go performance is good. If it is performed in C++ (either by code that is nearly identical to the Go grouping or by RocksDB) performance is bad.
https://github.com/cockroachdb/cockroach/pull/14138 started as an experiment to replace the grouping of write batches in RocksDB with our own grouping in Go to try and understand the smaller than expected group sizes. In addition to fixing the group sizes, it improved performance on one experiment by 100% while reducing latencies. This was unexpected as the Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB. In order to point the finger at RocksDB, we reimplemented the grouping of write batches in our Cgo RocksDB wrapper. Performance was equivalent to RocksDB performing the grouping.
I've provided a stripped down reproduction scenario at https://github.com/petermattis/batchtest. Running on my Mac laptop I see:
By default,
batchtest
uses 100 concurrent worker threads writing "batches". My suspicion is thatbatchtest
and CockroachDB are tickling some badness in the Go scheduler. If I setGOMAXPROCS
to the number of concurrent workers the cgo performance gets much closer to the Go performance:https://github.com/cockroachdb/cockroach/pull/14138 is an acceptable workaround for committing batches, but it would be great to understand the performance discrepancy here. We're concerned about other Cgo operations in CockroachDB that don't have such easy workarounds.