runtime: performance problem with many Cgo calls

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go version go1.8 darwin/amd64, though I've also verified this happens on Linux.

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build085228252=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

Background

CockroachDB internally uses RocksDB (a fork of LevelDB) for data storage. We access RocksDB via a small Cgo wrapper. All writes to RocksDB are performed using WriteBatches which are first written to a write-ahead-log and then inserted into the database. Internally, RocksDB "groups" multiple concurrent write batches together into a single append to the write-ahead-log. While investigating performance, we noticed that the size of the groups was smaller than expected.

What did you do?

TL;DR? If write batch grouping is performed in Go performance is good. If it is performed in C++ (either by code that is nearly identical to the Go grouping or by RocksDB) performance is bad.

https://github.com/cockroachdb/cockroach/pull/14138 started as an experiment to replace the grouping of write batches in RocksDB with our own grouping in Go to try and understand the smaller than expected group sizes. In addition to fixing the group sizes, it improved performance on one experiment by 100% while reducing latencies. This was unexpected as the Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB. In order to point the finger at RocksDB, we reimplemented the grouping of write batches in our Cgo RocksDB wrapper. Performance was equivalent to RocksDB performing the grouping.

I've provided a stripped down reproduction scenario at https://github.com/petermattis/batchtest. Running on my Mac laptop I see:

~ ./batchtest -t go
_elapsed____ops/sec
      1s     8029.6
      2s     8432.0
      3s     8100.1
      4s     8296.5
      5s     8171.8
      6s     8282.0
      7s     8040.7
      8s     8133.3
      9s     8240.4
     10s     8221.7

~ ./batchtest -t cgo
_elapsed____ops/sec
      1s     5036.4
      2s     2242.4
      3s     1284.3
      4s     1245.5
      5s     1254.5
      6s     1246.1
      7s     1962.5
      8s     4291.7
     10s     2036.3

By default, batchtest uses 100 concurrent worker threads writing "batches". My suspicion is that batchtest and CockroachDB are tickling some badness in the Go scheduler. If I set GOMAXPROCS to the number of concurrent workers the cgo performance gets much closer to the Go performance:

~ GOMAXPROCS=100 ./batchtest -t cgo
_elapsed____ops/sec
      1s     6943.9
      2s     7252.0
      3s     7090.4
      4s     6810.8
      5s     7326.4
      6s     7758.9
      7s     7897.7
      8s     7893.2
      9s     7022.5
     10s     6875.9

https://github.com/cockroachdb/cockroach/pull/14138 is an acceptable workaround for committing batches, but it would be great to understand the performance discrepancy here. We're concerned about other Cgo operations in CockroachDB that don't have such easy workarounds.

I haven't looked at the code. Is this all calls from Go into C? Are there any callbacks from C to Go?

You say that the "Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB". Combining that with the number of operations being done per second, would it be accurate to say that the C code itself, after called from Go, takes more than 20 microseconds to run?

In Go 1.8 when a goroutine calls into C, it is still holding a GOMAXPROCS slot and blocking other goroutines from running. Only if it is running in C for more than 20 microseconds or so will the system monitor thread decide that it is blocked in C code and activate another goroutine. The fact that your system performs better than you increase GOMAXPROCS makes me suspect that there is something to improve in that area of the code.

Is this all calls from Go into C? Are there any callbacks from C to Go?

The code is all calls from Go into C. There are no callbacks from C to Go. This is true of both the reproduction code and CockroachDB.

You say that the "Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB". Combining that with the number of operations being done per second, would it be accurate to say that the C code itself, after called from Go, takes more than 20 microseconds to run?

Yes. The actual RocksDB write batch operation involves a disk write, sometimes followed by an fsync. batchtest simulates this with a 5 millisecond sleep.

Only if it is running in C for more than 20 microseconds or so will the system monitor thread decide that it is blocked in C code and activate another goroutine.

I've been perusing that Go scheduler code. The 20us time is the minimum time sysmon will sleep for. The maximum delay is 10ms and a small bit of instrumentation (measuring the time around the usleep call) showed an actual maximum delay of 17ms, though I'm not sure how often that occurs.

Changing the max sysmon delay to 1ms results in a significant improvement to batchtest:

~/Development/go/src/github.com/petermattis/batchtest master ./batchtest -t cgo
_elapsed____ops/sec
      1s     7493.6
      2s     7279.6
      3s     7217.8
      4s     7498.1
      5s     7898.1
      6s     7899.2
      7s     7799.3
      8s     7798.7
      9s     7698.1
     10s     7395.3

Unfortunately, this didn't result in any improvement to the cockroach performance numbers, so perhaps batchtest isn't completely capturing whatever is going on.

Let me chime in. ( I came from https://github.com/golang/go/issues/21827#issuecomment-366332843 and the sympthoms are similar to what is described here so I decided to continue this thread instead of creating new issue. Please correct me if I am wrong. )

First some context:

Similarly to Cockroach who is using CGo wrappers for RocksDB I observed performance anomalies while using CGo wrappers for SQLite where presense of other goroutines combined with Cgo calls on "main" one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call sqlite3_interrupt if context is canceled.

With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (https://github.com/mattn/go-sqlite3/pull/530).

However Query was made faster only by 5% because after 2b283ce database/sql itself always spawns a goroutine to close Rows on context or transaction cancel. Avoiding to spawn Rows.awaitDone goroutine brings ~ 1.5x speedup to Query too (https://github.com/golang/go/issues/23879).

The theme here is that CGo calls show much higher overhead when performed not under simple only-one gorouitine scenario but in the presence of other goroutines. So let's make some analysis:

First, on my machine I can observe that under simple 1-goroutine-only scenario the cost of making 1 trivial CGo call is ~ 60ns, it scales linearly to the number of calls, it is independent of whether GOMAXPROCS is 1 or not, and it is ~ 35x slower compared to making a trivial Go call:

CallGo1              1.71ns ± 0%
CallGo1-4            1.71ns ± 0%
CallGo10             16.0ns ± 0%
CallGo10-4           16.1ns ± 0%
CallCGo1             58.3ns ± 1%
CallCGo1-4           58.6ns ± 1%
CallCGo10             626ns ± 0%
CallCGo10-4           611ns ± 5%

Then we can check what happens in the presence of other goroutines by simulating client-server requests over channel with client and server running in its own goroutines and server sequentially handling requests (so 2 goroutines in total) via making 1 trivial either Go or CGo call:

RTTSeqServerGo1     396ns ± 0%
RTTSeqServerGo1-4   458ns ± 0%
RTTSeqServerCGo1    454ns ± 0%
RTTSeqServerCGo1-4  536ns ± 1%

what can be seen here is that for GOMAXPROCS=1 δ(Go1,CGo1) ≈ t(CGo call), both Go and CGo RTT times are bigger for GOMAXPROCS!=1 with δ(Go1-4,CGo1-4) also growing to ~80ns.

It is somewhat understandable that processing time increases with GOMAXPROCS!=1 even for Go-only case because sometimes there now can be inter OS-threads communications, so in the context of this issue there is probably no problem here (though those 2 goroutines run almost not overlapping with each other and thus there should not be a need to put them to different OS threads). However δ(Go1,CGo1) increase over GOMAXPROCS=1 case suggests there might be something else going on. 20ns is however small and might be all noise, so let's recheck what happens when server performs 10 calls in the handler:

RTTSeqServerGo10      419ns ± 1%
RTTSeqServerGo10-4    486ns ± 1%
RTTSeqServerCGo10    1.00µs ± 0%
RTTSeqServerCGo10-4  1.97µs ± 4%

Go10, Go10-4 and CGo10 show time increase proportional to ~ 9·t(corresponding call). However CGo10-4 (i.e. CGo case with GOMAXPROCS!=1) is now 2x slower than CGo10 (i.e. the same case with GOMAXPROCS=1) by ~1µs.

The difference in between CGo10 and CGo10-4 can be attributed to inter OS-thread communications: the strace for CGo10 case is almost empty and contains only 10ms sleeps from sysmon during active phase:

1403  14:41:40.757900 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403  14:41:40.767974 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403  14:41:40.778049 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
...

However CGo10-4 shows different picture under strace - during active phase there is now constant high-rate churn of futex wait and wake in between 2 OS threads with 1-10 millisecond sleeps from sysmon also showing seldomly:

1447  14:42:41.733626 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733631 <... futex resumed> ) = 0
1447  14:42:41.733636 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733645 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733664 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733669 <... futex resumed> ) = 0
1447  14:42:41.733674 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733697 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733703 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733708 <... futex resumed> ) = 0
1447  14:42:41.733713 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733722 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733728 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733733 <... futex resumed> ) = 0
1447  14:42:41.733738 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733747 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733753 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733758 <... futex resumed> ) = 0
1447  14:42:41.733763 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733772 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
...

# seldom
1446  14:42:41.225481 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446  14:42:41.235566 <... pselect6 resumed> ) = 0 (Timeout)
1446  14:42:41.235594 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446  14:42:41.245656 <... pselect6 resumed> ) = 0 (Timeout)
1446  14:42:41.245670 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
...

This strace suggests that for CGo10-4 the scheduler somehow decided to keep one goroutine on one OS thread and another goroutine on another OS thread and then channel communications in between them somehow becomes futex ops which bring the overhead.

The draft notesleep spinning patch from https://github.com/golang/go/issues/21827#issue-256508518 can mitigate the CGo10-4 problem somewhat:

                     Go1.10       Go1.10+notesleep-spin
name                 old time/op  new time/op  delta
RTTSeqServerGo1       397ns ± 0%   405ns ± 0%   +2.01%  (p=0.000 n=9+10)
RTTSeqServerGo1-4     461ns ± 1%   462ns ± 0%     ~     (p=0.640 n=10+9)
RTTSeqServerCGo1      456ns ± 0%   466ns ± 0%   +2.17%  (p=0.000 n=9+9)
RTTSeqServerCGo1-4    532ns ± 1%   532ns ± 1%     ~     (p=0.751 n=10+10)
RTTSeqServerGo10      419ns ± 1%   424ns ± 0%   +1.17%  (p=0.000 n=10+10)
RTTSeqServerGo10-4    486ns ± 1%   484ns ± 1%   -0.51%  (p=0.018 n=10+10)
RTTSeqServerCGo10    1.00µs ± 0%  1.02µs ± 0%   +1.44%  (p=0.000 n=9+10)
RTTSeqServerCGo10-4  1.97µs ± 4%  1.13µs ± 1%  -42.92%  (p=0.000 n=10+10)

However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision: the individual CGo call time is well under 20µs - the time, if I understood correctly, after which sysmon decides that something is hanging in a CGo or syscall for too much and removes current OS thread from a GOMAXPROCS slot. The fact that sysmon is sleeping for >1ms in strace also confirm that sysmon does not intervene here.

I do not know Go scheduler insides, but clearly making several CGo call in sequence shakes it enough to bring some disorder and slowdown. For the reference when I was seeing the problem with SQLite my query time was ~ 5-6µs if I use the go driver directly and with standard

go func() {
    select {
    case <-ctx.Done():
        // call some cancel

    case <-done:
        // work finished ok
    }
}()

// perform query
close(done)

spawned around query with ctx never being canceled it was increasing the time to ~10µs. In other words it can be not only 1µs of additional penalty but higher.

In summary, like @petermattis said it looks like CGo calls sequence trigers some badness in Go scheduler that would be good to understand and hopefully fix.

Thanks beforehand, Kirill

/cc @ianlancetaylor, @dvyukov, @rsc, @aclements, @bcmills, @mattn

(cgo.go)

package cgotest

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}

(cgo_test.go)

package cgotest

import "testing"

//go:noinline
func goargadjust(i int) int {
    return i + 3
}

// BenchmarkCallGo1 measures overhead of 1 Go call.
func BenchmarkCallGo1(b *testing.B) {
    for i := 0; i < b.N; i++ {
        x := goargadjust(i)
        if x != i + 3 {
            b.Fatalf("goargadjust -> %d  ; want %d", x, i + 3)
        }
    }
}

// BenchmarkCallGo10 measures overhead of 10 Go calls performed sequentially.
func BenchmarkCallGo10(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j := 0; j < 10; j++ {
            x := goargadjust(i)
            if x != i + 3 {
                b.Fatalf("goargadjust -> %d  ; want %d", x, i + 3)
            }
        }
    }
}

// BenchmarkCallCGo1 measures overhead of 1 CGo call.
func BenchmarkCallCGo1(b *testing.B) {
    for i := 0; i < b.N; i++ {
        x := cargadjust(i)
        if x != i + 3 {
            b.Fatalf("cargadjust -> %d  ; want %d", x, i + 3)
        }
    }
}

// BenchmarkCallCGo10 measures overhead of 10 CGo calls performed sequentially.
func BenchmarkCallCGo10(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j := 0; j < 10; j++ {
            x := cargadjust(i)
            if x != i + 3 {
                b.Fatalf("cargadjust -> %d  ; want %d", x, i + 3)
            }
        }
    }
}

// ---------

type req struct {
    respq chan *resp
    arg   int
}

type resp struct {
    ret int
}

// client performs one request-reply cycle to a server over srvlink.
func client(srvlink chan *req, arg int) int {
    rc := make(chan *resp)
    srvlink <- &req{
        respq: rc,
        arg:   arg,
    }
    ret := <-rc
    return ret.ret
}

// server receives requests over inq, passes received request to
// handler, and returns result back over response channel.
//
// server runs until it processes all requests from inq.
type server func(inq chan *req, handler func(int) int)

// seqServer sequentially handles requests received over inq.
func seqServer(inq chan *req, handler func(int) int) {
    for r := range inq {
        r.respq <- &resp{ret: handler(r.arg)}
    }
}

// goServer concurently handles requests received over inq.
func goServer(inq chan *req, handler func(int) int) {
    for r := range inq {
        r := r
        go func() {
            r.respq <- &resp{ret: handler(r.arg)}
        }()
    }
}

// benchmarkRTT measures client-server round-trip latency for a particular
// server implementation.
func benchmarkRTT(b *testing.B, srv func(chan *req)) {
    inc := make(chan *req)
    go srv(inc)
    for i := 0; i < b.N; i++ {
        client(inc, i)
    }
    close(inc)
}

// --------

// mkServerGo1 returns function that will runs srv with N Go calls per request.
func mkServerGo1N(srv server, n int) func(chan *req) {
    return func(inq chan *req) {
        srv(inq, func(arg int) int {
            for i := 0; i < n; i++ {
                arg = goargadjust(arg)
            }
            return arg
        })
    }
}

// mkServerCGoN returns function that will run srv with N CGo calls per request.
func mkServerCGoN(srv server, n int) func(chan *req) {
    return func(inq chan *req) {
        srv(inq, func(arg int) int {
            for i := 0; i < n; i++ {
                arg = cargadjust(arg)
            }
            return arg
        })
    }
}

func BenchmarkRTTSeqServerGo1(b *testing.B) { benchmarkRTT(b, mkServerGo1N(seqServer, 1)) }
//func BenchmarkRTTGoServerGo1(b *testing.B)    { benchmarkRTT(b, mkServerGo1(goServer)) }

func BenchmarkRTTSeqServerCGo1(b *testing.B)    { benchmarkRTT(b, mkServerCGoN(seqServer, 1)) }
//func BenchmarkRTTGoServerCGo1(b *testing.B)   { benchmarkRTT(b, mkServerCGoN(goServer,  1)) }

func BenchmarkRTTSeqServerGo10(b *testing.B)    { benchmarkRTT(b, mkServerGo1N(seqServer, 10)) }
func BenchmarkRTTSeqServerCGo10(b *testing.B)   { benchmarkRTT(b, mkServerCGoN(seqServer, 10)) }
//func BenchmarkRTTGoServerCGo10(b *testing.B)  { benchmarkRTT(b, mkServerCGoN(goServer,  10)) }

However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision

From run time perspective both RTTSeqServerCGo1 and RTTSeqServerCGo10 do exactly the same -- call a long sequences of cgo calls. From run time perspective the inner loop does not make any difference. So there is something else happening. CPU scaling/turbo/overheat? Have you tried running them for longer? in isolation? with CPU affinity? turning off CPU scaling?

@dvyukov thanks for feedback.

From run time perspective both RTTSeqServerCGo1 and RTTSeqServerCGo10 do exactly the same -- call a long sequences of cgo calls.

There is a difference - CGo1 case is like this:

        Gclient                 Gserver

        srvlink <- req
                                req <- srvlink
                                CGo call x1
                                resp <- ...
        resp <- ...

while CGo10 is like:

        Gclient                 Gserver

        srvlink <- req
                                req <- srvlink
                                CGo call x10
                                resp <- ...
        resp <- ...

every CGo call triggers at least runtime.entersyscall / runtime.exitsyscall so there is a difference in runtime call sequences in between CGo1 and CGo10 - channel sends operations are also there and the difference is in how many cgo calls are made in between server receives a request and sends a reply for it back.

I was running the tests with cpupower frequency-set -g performance and monitoring dmesg -w for overheating and that CPU throttling is not getting in. I've also reverified timings several times via running each test individually with a pause (so CPU can get calmer) in between different runs.

The only mistake from me in previous runs was that the turbo was not disabled so now to recheck I disable it with cpupower frequency-set --min 2.6Ghz --max 2.6GHz (2.6GHz is max non-turbo frequency of my cpu) and show you below a complete dump of a relatively long run where different tests are intermixed with each other so that if e.g. overheating or something else gets in it should be visible to both CGo1 and CGo10:

$ ./neotest info-local
# Sun, 18 Feb 2018 17:04:37 +0300
# kirr@deco.navytux.spb.ru (2401:5180:0:37::1 192.168.0.2)
# Linux deco 4.14.0-3-amd64 #1 SMP Debian 4.14.13-1 (2018-01-14) x86_64 GNU/Linux
# cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
# cpu[0-3]: freq: intel_pstate/performance [2.60GHz - 2.60GHz]
# cpu[0-3]: idle: intel_idle/menu: POLL(0μs) C1(2μs) C1E(10μs) C3(70μs) C6(85μs) C7s(124μs) C8(200μs) C9(480μs) C10(890μs)
# cpu: WARNING: C-state exit-latency is max 890μs - up to that can add to networked and IPC request-reply latency
# sda: SanDisk X400 M.2  rev 0012  477G
# eth0: Intel Corporation Ethernet Connection I219-LM rev 21
# eth0: features: rx tx sg tso !ufo gso gro !lro rxvlan txvlan !ntuple rxhash ...
# eth0: coalesce: rxc: 3μs/0f/0μs-irq/0f-irq,  txc: 0μs/0f/0μs-irq/0f-irq
# eth0: up, speed=1000, mtu=1500, txqlen=1000, gro_flush_timeout=0.000µs
# wlan0: Intel Corporation Wireless 8260 rev 3a
# wlan0: features: !rx !tx sg !tso !ufo gso gro !lro !rxvlan !txvlan !ntuple !rxhash ...
# wlan0: coalesce: rxc: ?,  txc: ?
# wlan0: down, speed=?, mtu=1500, txqlen=1000, gro_flush_timeout=0.000µs
# wlan0: WARNING: TSO not enabled - TCP latency with packets > MSS will be poor
# Python 2.7.14
# go version go1.10 linux/amd64
# sqlite 3.22.0 (py mod 2.6.0)
# mysqld  Ver 10.1.29-MariaDB-6 for debian-linux-gnu on x86_64 (Debian buildd-unstable)
# neo             : v1.8.1-1532-gc0502d41-dirty
# zodb            : 5.3.0-11-gde1f24ca0
# zeo             : 5.1.1-1-gfdcc2739
# mysqlclient     : 1.3.12
# wendelin.core   : v0.11-14-gc3cc8a9

$ time bash -c 'for in in `seq 10`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s -test.cpu 1,4; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               592 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               640 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1302 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2499 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               592 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               652 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1319 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2387 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               592 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               640 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1303 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2489 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       20000000               604 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               651 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1335 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2464 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               595 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               649 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1329 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2423 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       20000000               601 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               646 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1300 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2392 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       20000000               603 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               640 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1309 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2482 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       20000000               601 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               647 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1313 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2439 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               597 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               647 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1306 ns/op
BenchmarkRTTSeqServerCGo10-4    10000000              2451 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1       30000000               597 ns/op
BenchmarkRTTSeqServerCGo1-4     20000000               656 ns/op
BenchmarkRTTSeqServerCGo10      10000000              1312 ns/op
BenchmarkRTTSeqServerCGo10-4     5000000              2424 ns/op
PASS

real    10m1,226s
user    11m14,922s
sys     0m42,169s

i.e. the timings are stable, there is no outliers and it is

RTTSeqServerCGo1      597ns ± 1%
RTTSeqServerCGo1-4    647ns ± 1%
RTTSeqServerCGo10    1.31µs ± 2%
RTTSeqServerCGo10-4  2.44µs ± 2%

this covers "scaling/turbo/overheat and "running for longer".

Now let's check setting affinity. First: just setting GOMAXPROCS=2 does not make much difference to default GOMAXPROCS=4:

$ time bash -c 'for in in `seq 3`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s -test.cpu 2; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               647 ns/op
BenchmarkRTTSeqServerCGo10-2     5000000              2682 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               655 ns/op
BenchmarkRTTSeqServerCGo10-2     5000000              2615 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               655 ns/op
BenchmarkRTTSeqServerCGo10-2     5000000              2551 ns/op
PASS

real    1m28,294s
user    1m47,062s
sys     0m14,043s

but CGo10-2 overhead goes away if I set test process affinity to only two first CPUs:

$ time taskset -c 0,1 bash -c 'for in in `seq 3`; do ./cgo.test -test.bench "RTTSeqServerCGo1.*" -test.benchtime 10s; done'
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               608 ns/op
BenchmarkRTTSeqServerCGo10-2    10000000              1364 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               619 ns/op
BenchmarkRTTSeqServerCGo10-2    10000000              1369 ns/op
PASS
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo1-2     20000000               604 ns/op
BenchmarkRTTSeqServerCGo10-2    10000000              1376 ns/op
PASS

real    1m23,806s
user    1m28,831s
sys     0m1,393s

probably because setting affinity removes many context switches:

$ perf stat ./cgo.test -test.bench "RTTSeqServerCGo10" -test.benchtime 10s
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo10-4    10000000              1741 ns/op
PASS

 Performance counter stats for './cgo.test -test.bench RTTSeqServerCGo10 -test.benchtime 10s':

      24547,864730      task-clock (msec)         #    1,280 CPUs utilized          
         1 968 952      context-switches          #    0,080 M/sec                  
               745      cpu-migrations            #    0,030 K/sec                  
             1 824      page-faults               #    0,074 K/sec                  
    59 997 399 275      cycles                    #    2,444 GHz                    
    95 671 377 235      instructions              #    1,59  insn per cycle         
    19 118 977 436      branches                  #  778,845 M/sec                  
        76 635 472      branch-misses             #    0,40% of all branches        

      19,172156094 seconds time elapsed

$ taskset -c 0,1 perf stat ./cgo.test -test.bench "RTTSeqServerCGo10" -test.benchtime 10s
goos: linux
goarch: amd64
BenchmarkRTTSeqServerCGo10-2    10000000              1459 ns/op
PASS

 Performance counter stats for './cgo.test -test.bench RTTSeqServerCGo10 -test.benchtime 10s':

      18226,710279      task-clock (msec)         #    1,136 CPUs utilized          
           865 349      context-switches          #    0,047 M/sec                  
             1 729      cpu-migrations            #    0,095 K/sec                  
             1 797      page-faults               #    0,099 K/sec                  
    45 725 017 537      cycles                    #    2,509 GHz                    
    85 013 647 314      instructions              #    1,86  insn per cycle         
    17 127 907 786      branches                  #  939,715 M/sec                  
        72 766 706      branch-misses             #    0,42% of all branches        

      16,045203680 seconds time elapsed

and it was already discussed in the second half of https://github.com/golang/go/issues/21827#issue-256508518 how futex_wake can be changing CPU of a woken process.

HOWEVER

Setting affinity cannot work in practice - if we imagine a server which for every request spawns separate goroutine, and the CGo calls have to be done from under that request-handling goroutine, we now have N(goroutines) > Ncpu. Since we cannot set affinity of a goroutine - at least we cannot do so at present with public API - the only choise is to let the whole process use all availble CPU resources dedicated to service and oops -> context switches could be happenning all the time.

There is a difference - CGo1 case is like this:

Ah, sorry. I misread the test. I thought it's just cgo calls.

np, and thanks for feedback. Looking forward to what might be done about it.

There were several projects which were being rewritten in C or C++, because of cgo overhead. One example : https://github.com/canonical/dqlite/blob/master/doc/faq.md#why-c When I tried to introduce go to my team, same problem (In 2018).

Can somebody try to work on this?

Thank you.

Is there any update on this? I'm also having this problem too.

For my use case, I'm using https://github.com/cisco/openh264, which is written in C++, to encode every frame that I get from my camera through https://en.wikipedia.org/wiki/Video4Linux.

Following is the output from pprof Screenshot from 2019-11-30 16-24-16

There are no updates to this issue. If there are updates, they will be reported here.

Note that your profile is not showing you that runtime.cgocall itself is very slow, it's showing you that your C code overall is taking up 77% of your program's execution time. The Go profiler won't by default tell you anything about C code. If it would be useful for you you can try importing github.com/ianlancetaylor/cgosymbolizer to get C code profiling; that is completely unsupported but it may help. It requires that your C code be compiled with the -g option.

Ah I see, I'm sorry for the misunderstanding. I'll try to import github.com/ianlancetaylor/cgosymbolizer. Thanks.

FYI I ran the test in the OP on 1.17 & cgo performance roughly matches go performance (and may be slightly better?)

I also ran navytux's test, and the cgo overhead does still roughly double when GOMAXPROCS>1 when there are go operations in flight for his RTT tests, but no evidence of anything with as large of impact as the OP.

I do think the overhead, such as it is, is likely related to #21827. Goroutines that interact with cgo are likely to end up, via brownian motion, in the same position as an OS-locked thread: a single G on a single M/P. When the thread parks to collect info, its M is parked in turn. We don't end up in this case when GOMAXPROCS=1, because cgo can't subtly move the G onto its own M/P, it has to keep squishing the cgo G back into the single P when it comes back from syscall, so the subsequent G park when it reads off channel doesn't cause problems.

I think broadly there's an issue where parking an M is intended to happen when there is clearly not enough work around to sustain an entire M, but there are a few cases (os-locked threads, threads that talk to cgo a lot) where there's good reason for a G to be hanging out on its own. It would probably be better if os-locked threads never parked their M and then we could os-lock threads if we are talking to cgo a lot and have a problem with this behavior.

EDIT: Alternatively go could just be way more amenable to cramming G's coming back from syscall into an existing P but the performance impact of that seems hard to predict.

Here is a new scenario that make me concern the performance a lot: Conn Read/Write on Windows, which calls API WSARecv/WSASend according runtime.cgocall. It is a very basic functionality for web servers, proxies, and other network data transferring tools.

Low performance will pop up when there is a large amount of data, with high CPU usage (10%+, task manager) which could be 10+ times than rust/c developed similar apps.

An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:

buffers should not be too big, or when connections increase, they consume lots of memory;
if packets data can't be piled up, that will make buffering ineffective;
must be answered small and high frequency packets are common.

It is really a pity to see it is unplanned! Many people have noticed it and researched on it. Hoping it could draw more attentions and more improvements could be achieved, earlier.

Would probably help if you got a self-contained example like in the OP and opened a new issue. The example in this one no longer produces the results linked in the OP, so I would not expect to see this task addressed, but your new one might be.

@CannibalVox Thank you for your advice! I have create https://github.com/golang/go/issues/58336.

golang / go