runtime: `cgocall` is low performance for Conn on Windows

lifenjoiner commented 1 year ago

What version of Go are you using (`go version`)?

$ go version
go version go1.20 windows/386

Does this issue reproduce with the latest release?

Yes. It is long standing.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
set GO111MODULE=auto
set GOARCH=386
set GOBIN=D:\build\go\GOPATH\bin
set GOCACHE=D:\build\go\GOCACHE
set GOENV=C:\Users\HYX\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=386
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=D:\build\go\GOPATH\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=D:\build\go\GOPATH
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=D:\build\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=D:\build\go\pkg\tool\windows_386
set GOVCS=
set GOVERSION=go1.20
set GCCGO=gccgo
set GO386=sse2
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m32 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\Users\HYX\AppData\Local\Temp\go-build3874135637=/tmp/go-build -gno-record-gcc-switches

What did you do?

I think this is a new case of https://github.com/golang/go/issues/19574, taking effect on higher level Conn and specific platform Windows. It is suggested to be a new issue.

This is a long term user observation.

It is traced into cgocall by pprof (high percentage): on Windows, Conn Read/Write finally calls API WSARecv/WSASend according runtime.cgocall.

It is a very basic functionality for web servers, proxies, and other network data transferring tools.

Low performance will pop up when there is a large amount of data, with high CPU usage (for example 10%+, task manager, OS level) which could be 10+ times than rust/c developed similar apps.

~~Since the scenario is network data transferring, I don't have a simple/short example that can reproduce the issue . Anyone who can generate it is appreciated.~~

Proxies are the most typical case (both Read/Write). My example is bellow in the reply.

Test existing programs if you are interested

~~Maybe we can get started with some high rated open source proxies for now.~~ Here I used the client as a local proxy without any upstream proxies, and [wget](https://github.com/lifenjoiner/wget-for-windows/releases) (single thread) to download via it. **Attention**: downloading self hosted resources on the same machine doesn't help reproduce the difference. Loopback MSS = 65495 is adopted. go: [xray](https://github.com/XTLS/Xray-core/releases), [v2ray](https://github.com/v2fly/v2ray-core/releases), and [glider](https://github.com/nadoo/glider/releases). ``` xray.exe -config direct.json ``` ``` v2ray.exe run -config direct.json ``` ``` wget.exe -c -O nul -e https_proxy=127.0.0.1:5080 ``` Config file: [direct.json.zip](https://github.com/golang/go/files/10610846/direct.json.zip). ``` glider.exe -listen=http://127.0.0.1:6080 ``` rust: [shadowsocks-rust](https://github.com/shadowsocks/shadowsocks-rust/releases). ``` sslocal.exe -b 127.0.0.1:7080 --protocol=http ```

Mitigation An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:

buffers should not be too big, or when connections increase, they consume lots of memory;
if packets data can't be piled up, that will make buffering ineffective;
must be answered small and high frequency packets are common.

My experimental local proxy with big buffer.

I have tried ReadFrom, and observed 4k length packets are transferred on loopback interface, triggering high CPU usage.

What did you expect to see?

Lower CPU usage. OS level.

What did you see instead?

High CPU usage.

CannibalVox commented 1 year ago

There's not really a whole lot of information here- you said you'd previously looked at it in pprof, what did you run with pprof, can you link the output? I'd also consider changing the package in the title to net - for all we know, cgo could be working directly but net is calling it too many times.

CannibalVox commented 1 year ago

Does this reproduce when downloading large files from http?

lifenjoiner commented 1 year ago

I have constructed a short simple proxy, that can help reproduce the issue.

tunnel.go Source Code (Click to show)

```go package main import ( "flag" "log" "net" "sync" "time" ) const GlobalTimeout = 15 * time.Second type Config struct { BuffSize int LocalAddr string RemoteAddr string } func parseConfig() *Config { conf := &Config{} flag.IntVar(&conf.BuffSize, "buffsize", 8192, "Specify connection buffer size.") flag.StringVar(&conf.LocalAddr, "localaddr", "127.0.0.1:6699", "Local address to listen to.") flag.StringVar(&conf.RemoteAddr, "remoteaddr", "", "Remote address to connect to.") flag.Parse() return conf } func ListenAndServe(cfg *Config) { l, err := net.Listen("tcp", cfg.LocalAddr) if err != nil { log.Printf("[tcp] failed to listen on %s: %v\n", cfg.LocalAddr, err) return } defer l.Close() log.Printf("[tcp] listening on %s\n", cfg.LocalAddr) for { c, err := l.Accept() if err != nil { log.Printf("[tcp] failed to accept: %v\n", err) continue } go Serve(c, cfg) } } func Serve(c net.Conn, cfg *Config) { var wg sync.WaitGroup var LrErr, LwErr, RrErr, RwErr error defer c.Close() LeftTimeout := 2 * GlobalTimeout RightTimeout := GlobalTimeout cr, err := net.DialTimeout("tcp", cfg.RemoteAddr, GlobalTimeout) if err != nil { return } defer cr.Close() wg.Add(1) go func() { defer wg.Done() var n int LeftBuf := make([]byte, cfg.BuffSize) for { c.SetDeadline(time.Now().Add(LeftTimeout)) n, LrErr = c.Read(LeftBuf) if LrErr == nil { cr.SetDeadline(time.Now().Add(RightTimeout)) _, RwErr = cr.Write(LeftBuf[0:n]) } if LrErr != nil || LwErr != nil || RrErr != nil || RwErr != nil { c.SetDeadline(time.Now()) cr.SetDeadline(time.Now()) break } } }() RightBuf := make([]byte, cfg.BuffSize) for { var n int cr.SetDeadline(time.Now().Add(RightTimeout)) n, RrErr = cr.Read(RightBuf) if RrErr == nil { c.SetDeadline(time.Now().Add(LeftTimeout)) _, LwErr = c.Write(RightBuf[0:n]) } if LrErr != nil || LwErr != nil || RrErr != nil || RwErr != nil { c.SetDeadline(time.Now()) cr.SetDeadline(time.Now()) break } } wg.Wait() if RrErr != nil { log.Printf("Remote error: %v\n", RrErr) } else if RwErr != nil { log.Printf("Remote error: %v\n", RwErr) } } func main() { cfg := parseConfig() if cfg.RemoteAddr == "" { log.Printf("Invalid remoteaddr!") return } ListenAndServe(cfg) } ```

If you want to pprof, drop pprof.go in the same folder, and compile with -tags=pprof.

Because of it is a very simple proxy, the functionality is limited too.

Usage of tunnel.exe:
  -buffsize int
        Specify connection buffer size. (default 8192)
  -localaddr string
        Local address to listen to. (default "127.0.0.1:6699")
  -remoteaddr string
        Remote address to connect to.

Here is one example for testing:

tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dl.google.com:443 -buffsize=512000

curl.exe -o nul --resolve dl.google.com:443:127.0.0.1 https://dl.google.com/go/go1.20.windows-386.zip

You may specify a common buffsize that you prefer. Bigger buffer could reduce cgocall times, but not 100%.

Results of a same high speed resource on my side:

graph-sample-8kBuffer-CPUHigh.png

![graph-sample-8kBuffer-CPUHigh](https://user-images.githubusercontent.com/19585474/216949348-fea23fdc-1b99-4fe7-b97c-16443fb883aa.png)

graph-sample-500kBuffer-CPULow.png

![graph-sample-500kBuffer-CPULow](https://user-images.githubusercontent.com/19585474/216949500-ebe0e257-aa7d-4577-9b22-f9b851db25ba.png)

@CannibalVox

net is calling it too many times

I think that would be necessary, as messages need responses (in time).

Does this reproduce when downloading large files from http?

Of course. It shouldn't be related to the contents. I also confirmed it.

The key is the contexts switching cost of cgo calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.

ianlancetaylor commented 1 year ago

When the profile tells you that a lot of time is spent in cgocall what it is really telling you is that your program is spending most of its time in functions that are not written in Go. The profiler is not able to get a stack trace from functions that are not written in Go, so what you see is the backtrace in Go code that leads up to the non-Go code. That backtrace ends in cgocall, because that is the name of the function that calls the non-Go function.

So it's not clear that this is a bug as such. Of course it would be nice to get more complete information. And it is possible that your program is really not spending much time in non-Go code, in which case there may be a bug somewhere.

lifenjoiner commented 1 year ago

cgocall high CPU usage could be in 2 scopes:

program level, by pprof,
OS level, by task manager.

The problem here is in OS level, comparing to other language developed programs. For example shadowsocks-rust which I mentioned above. They are tested by downloading the same file with the same speed. Using Wireshark, I can see the max packet-data-length from shadowsocks-rust is 8192. There should be same level API/cgo calling. While setting the same buffer size to the above example, OS level high CPU usage is observed.

@ianlancetaylor

When the profile tells you that a lot of time is spent in cgocall what it is really telling you is that your program is spending most of its time in functions that are not written in Go.

This is not the answer here, I think. All different languages call the OS network APIs. The profiles don't really count here though, some people still may want to have a glance, as they tell something.

My descriptions may not cover what you concern, please have a try by yourself.

CannibalVox commented 1 year ago

The key is the contexts switching cost of cgo calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.

It seems like it's possible that WSARecv is making blocking calls which puts the syscall over the 1us timeout to rejoin the P without penalty, which will create a fair amount of context switching when called repeatedly. I believe linux & mac make non-blocking calls for socket read/write and get results via netpoll?

I don't know how I would check that this is the case, though. Obviously net.Conn.Read is blocking, but that doesn't say much about syscall.WSARecv.

thepudds commented 1 year ago

Hi @lifenjoiner, do you have numeric CPU usage numbers reported by the OS you could share for your sample program vs maybe shadowsocks-rust or one of the other programs?

CannibalVox commented 1 year ago

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

lifenjoiner commented 1 year ago

Measurements by typeperf:

Base info

Both buffer size are (should be) 8k.

tunnel.exe

curl.exe -o nul -x http://127.0.0.1:7080 <URL>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0   948k      0  0:06:58  0:06:58 --:--:--  948k

sslocal.exe

curl.exe -o nul --resolve <domain>:443:127.0.0.1 <URL>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0   949k      0  0:06:57  0:06:57 --:--:--  949k

Results

% User Time is likely to be what I saw in Process Explorer.

tunnel.exe tunnel

sslocal.exe sslocal

Raw data: typeperf-process-counters.tunnel-vs-sslocal.zip

CannibalVox commented 1 year ago

Do you have a tool to measure context switching in there?

lifenjoiner commented 1 year ago

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

Does waiting consume lots of CPU resource?

If you go to View -> Source in the http UI mode of pprof, you will see most of time are spent at: https://github.com/golang/go/blob/02d8ebda8398342516a24a3e8435d527ed156cab/src/runtime/cgocall.go#L157

Do you have a tool to measure context switching in there?

Currently, no. I can see the `Properties -> Threads -> Cycles Delta` in `Process Explorer`:	Process	Threads	Cycles Delta
tunnel.exe	2+	50+ M
sslocal.exe	1	20 ~ 30 M

lifenjoiner commented 1 year ago

Context switching rate should be high, referring:

You can determine whether context switching is excessive by comparing it with the value of Processor\% Privileged Time. If this counter is at 40 percent or more and the context-switching rate is high, then you can investigate the cause for high rates of context switches.

It says \Processor(_Total)\% Privileged Time, here shows \Process(tunnel)\% Privileged Time and \Process(sslocal)\% Privileged Time, which should tell something.

tunnel.exe tunnel-2

sslocal.exe sslocal-2

All data are included in the attachment above.

CannibalVox commented 1 year ago

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

Does waiting consume lots of CPU resource?

No- rather, when go goes into the syscall state, it takes the os thread with it. My understanding is that it has 1us to return before go spins up another os thread to handle the goroutines that were previously being handled by the os thread that was taken for the syscall. The thing is that if there's not enough work to sustain an additional os thread than what it was previously using (which is most cases, since go is generally using the correct number of threads), go will generally keep spinning up and shutting down os threads, causing an obscene amount of context switches, which seems to be what's happening here.

I would say that go's handling of goroutines returning from syscall is generally poor when it can't put everything back exactly where it found it, and getting to the point where this code maintained two os threads due to the usage patterns in play would probably be ideal, even though making these calls nonblocking would be a quick fix

CannibalVox commented 1 year ago

btw there's another case like this (poor decisions about when to spin down os threads cause insane context switches) I'm aware of that isn't cgo related: #21827

In that case, parking an os locked thread with sleep or a blocking channel read will cause the os thread to be spun down since it isn't doing any work, the goroutine is moved to a thread where it can't run, and then the locked thread has to be awoken when the goroutine unparks, causing two context switches.

prattmic commented 1 year ago

@lifenjoiner Could you collect a runtime trace (https://pkg.go.dev/runtime/trace#Start) from this example? Just collect for a few seconds, otherwise you'll probably get a huge trace file.

lifenjoiner commented 1 year ago

@prattmic Many thanks for diving in!

Here is a trace file for 30s while downloading is stable. trace-30s.zip

Method: https://pkg.go.dev/net/http/pprof#hdr-Usage_examples

curl -o trace-30s.out http://localhost:6060/debug/pprof/trace?seconds=30

Append: One more includes download startup: trace-30s-includes-download-startup.zip

CannibalVox commented 1 year ago

Hey @prattmic I saw that you were assigned #44343 and it is scheduled for 1.21. If the goal is to deal with tasks that are a blocker from Windows users upgrading go to the latest version, this is a big one, and while it raises some interesting high level questions about the way that the scheduler handles things, the task as reported may not be that complicated. Network calls on windows appear to be synchronous and they need to be made asynchronous in order to stop this behavior I believe. I don't see how it's possible to run a web service on windows in Go 1.21 without this task being fixed.

qmuntal commented 1 year ago

@lifenjoiner could you run your benchmarks using go tip? Some weeks ago I submitted an improvement that asyncify some udp/tcp operations: CL 463839.

lifenjoiner commented 1 year ago

@qmuntal Thanks for your work! Following are my benchmarks.

Base Info

go1.20.3.windows-386.zip
Patch. Manually patched my local src/internal/poll/fd_windows.go. Confirmed it works by making a syntax mistake of the file did trigger a compiling error.
set GOARCH=amd64 to build PE64 files. Keep the same page.
My network changed. So, --limit-rate is added to my previous downloading commands to make similar tests. But it is NOT at a uniform speed as before. I saw peaks and pause of the downloading speed and CPU usage by Process Explorer. So, another group of tests with a higher downloading speed was made.
A different PC.

Results

--limit-rate 1M, 2.753 / 0.293 = 9.4. Previous: 6.367 / 0.543 = 11.7.

tunnel.exe

$>curl.exe --limit-rate 1M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  1036k      0  0:06:22  0:06:22 --:--:-- 1272k

tunnel-d1mps-u-20230418

sslocal.exe

$>curl.exe --limit-rate 1M -o nul -x http://127.0.0.1:7080 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  1033k      0  0:06:23  0:06:23 --:--:-- 1184k

sslocal-d1mps-u-20230418

--limit-rate 3M, 8.541 / 0.729 = 11.7.

tunnel.exe

$>curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>/WeCom_4.1.0.6007.exe
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  3000k      0  0:02:12  0:02:12 --:--:-- 1005k

tunnel-d3mps-u-20230418

sslocal.exe

$>curl.exe --limit-rate 3M -o nul -x http://127.0.0.1:7080 <url>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  387M  100  387M    0     0  3203k      0  0:02:03  0:02:03 --:--:-- 4164k

sslocal-d3mps-u-20230418

Conclusion

Seems there are still more improvements can be done.

qmuntal commented 1 year ago

Conclusion Seems there are still more improvements can be done.

I'm having a hard time interpreting those results. Has there been an improvement on the metric you are using to benchmark this issue? I want to know if CL 463839 goes in the right direction or is something unrelated.

lifenjoiner commented 1 year ago

The screenshots are reports generated by typeperf. I've selected the valid duration and % User Time. It shows the average under the scroll bar.

I don't know if CL 463839 is the right direction. Here are my new benchmarks on the same PC and network. CL463839 Vs NoPatch. Both with 10 samples.

My network is not very stable. Maybe you can remeasure them by yourself.

Prerequisite

benchmark.bat. The script that I used.

@echo off
setlocal ENABLEDELAYEDEXPANSION
pushd %~dp0

::-code-::

for /L %%i in (0, 1, 9) do (
    title %%i %date% %time%
    start /min typeperf -f csv -si 00:01 -cf typeperf-process-counters.txt -o typeperf-process-counters-%%i.csv
    sleepy 2s
    curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 https://dldir1.qq.com/wework/work_weixin/WeCom_4.1.3.6008.exe
    sleepy 2s
    taskkill /f /im typeperf.exe
)

::-code-::

:Exit
popd
endlocal
exit /b

You may choose a URL by your convenient.

typeperf-process-counters.txt

\Process(tunnel)\% Privileged Time
\Process(tunnel)\% Processor Time
\Process(tunnel)\% User Time

CL463839.exe and NoPatch.exe

tunnel.go

tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dldir1.qq.com:443

Compile them by yourself. All other files and logs are in the attached zip file.

Data Files

CL463839-Vs-NoPatch_10-samples.zip benchmark.xlsx

Analysis

CL463839-Vs-NoPatch

qmuntal commented 1 year ago

I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.

@CannibalVox WSARecv is non-blocking, have verified that using @lifenjoiner's tunnel.go.

@lifenjoiner your initial guess seems to be right, there are a lot of context switches due to the high amount of cgo calls. I haven't seen any low hanging fruit in the net nor internall/poll that could mitigate that, so I'm inclined to say that this is a duplicated of #19574.

CannibalVox commented 1 year ago

I don't understand, I thought sub-1us cgo calls don't cause context switches

CannibalVox commented 1 year ago

Also #19574 didn't repro when I tried it a few months ago

lifenjoiner commented 1 year ago

@qmuntal Thanks for your efforts!

Hoping someone will make more progress.

gopherbot commented 8 months ago

Change https://go.dev/cl/562915 mentions this issue: runtime: don't call lockOSThread for every syscall call on Windows

qmuntal commented 7 months ago

@lifenjoiner CL 562915 might fix your issue. It removes the LockOSThread call from syscall.Syscall, which is a cause of context switches. Could you give it a try?

To build a Go application using a specific commit, follow these instructions:

run go install golang.org/dl/gotip@latest
run gotip download 562915
run gotip build .

lifenjoiner commented 7 months ago

--limit-rate 3M
Measured curl at the same time.

CL562915.xlsx

Seems better. Great work! @qmuntal

CannibalVox commented 7 months ago

!!! This seems like a really big deal, @qmuntal, thank you very much!

qmuntal commented 6 months ago

Seems better. Great work! @qmuntal

If I read your charts correctly, CL 562915 reduced User Time from ~6% to ~2% and Privileged Time from ~50% to ~6%. Given that the Privileged Time is now way less than the high-context-switch threshold (40% according to https://github.com/golang/go/issues/58336#issuecomment-1420216511), IMO we can close this issue.

golang / go