Closed lifenjoiner closed 6 months ago
There's not really a whole lot of information here- you said you'd previously looked at it in pprof, what did you run with pprof, can you link the output? I'd also consider changing the package in the title to net
- for all we know, cgo could be working directly but net is calling it too many times.
Does this reproduce when downloading large files from http?
I have constructed a short simple proxy, that can help reproduce the issue.
tunnel.go
Source Code (Click to show)If you want to pprof
, drop pprof.go in the same folder, and compile with -tags=pprof
.
Because of it is a very simple proxy, the functionality is limited too.
Usage of tunnel.exe:
-buffsize int
Specify connection buffer size. (default 8192)
-localaddr string
Local address to listen to. (default "127.0.0.1:6699")
-remoteaddr string
Remote address to connect to.
Here is one example for testing:
tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dl.google.com:443 -buffsize=512000
curl.exe -o nul --resolve dl.google.com:443:127.0.0.1 https://dl.google.com/go/go1.20.windows-386.zip
You may specify a common buffsize
that you prefer. Bigger buffer could reduce cgocall
times, but not 100%.
Results of a same high speed resource on my side:
@CannibalVox
net is calling it too many times
I think that would be necessary, as messages need responses (in time).
Does this reproduce when downloading large files from http?
Of course. It shouldn't be related to the contents. I also confirmed it.
The key is the contexts switching cost of cgo
calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.
When the profile tells you that a lot of time is spent in cgocall
what it is really telling you is that your program is spending most of its time in functions that are not written in Go. The profiler is not able to get a stack trace from functions that are not written in Go, so what you see is the backtrace in Go code that leads up to the non-Go code. That backtrace ends in cgocall
, because that is the name of the function that calls the non-Go function.
So it's not clear that this is a bug as such. Of course it would be nice to get more complete information. And it is possible that your program is really not spending much time in non-Go code, in which case there may be a bug somewhere.
cgocall
high CPU usage could be in 2 scopes:
pprof
,The problem here is in OS level, comparing to other language developed programs.
For example shadowsocks-rust
which I mentioned above.
They are tested by downloading the same file with the same speed. Using Wireshark
, I can see the max packet-data-length from shadowsocks-rust
is 8192. There should be same level API/cgo calling. While setting the same buffer size to the above example, OS level high CPU usage is observed.
@ianlancetaylor
When the profile tells you that a lot of time is spent in cgocall what it is really telling you is that your program is spending most of its time in functions that are not written in Go.
This is not the answer here, I think. All different languages call the OS network APIs. The profiles don't really count here though, some people still may want to have a glance, as they tell something.
My descriptions may not cover what you concern, please have a try by yourself.
The key is the contexts switching cost of cgo calling. That seems to be a bit too much, resulting high speed Read/Write cost more resources.
It seems like it's possible that WSARecv is making blocking calls which puts the syscall over the 1us timeout to rejoin the P without penalty, which will create a fair amount of context switching when called repeatedly. I believe linux & mac make non-blocking calls for socket read/write and get results via netpoll?
I don't know how I would check that this is the case, though. Obviously net.Conn.Read is blocking, but that doesn't say much about syscall.WSARecv.
Hi @lifenjoiner, do you have numeric CPU usage numbers reported by the OS you could share for your sample program vs maybe shadowsocks-rust or one of the other programs?
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.
Measurements by typeperf
:
Base info
Both buffer size are (should be) 8k.
tunnel.exe
curl.exe -o nul -x http://127.0.0.1:7080 <URL>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 948k 0 0:06:58 0:06:58 --:--:-- 948k
sslocal.exe
curl.exe -o nul --resolve <domain>:443:127.0.0.1 <URL>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 949k 0 0:06:57 0:06:57 --:--:-- 949k
Results
% User Time
is likely to be what I saw in Process Explorer
.
tunnel.exe
sslocal.exe
Do you have a tool to measure context switching in there?
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.
Does waiting consume lots of CPU resource?
If you go to View -> Source
in the http UI mode of pprof
, you will see most of time are spent at:
https://github.com/golang/go/blob/02d8ebda8398342516a24a3e8435d527ed156cab/src/runtime/cgocall.go#L157
Do you have a tool to measure context switching in there?
Currently, no.
I can see the Properties -> Threads -> Cycles Delta in Process Explorer : |
Process | Threads | Cycles Delta |
---|---|---|---|
tunnel.exe | 2+ | 50+ M | |
sslocal.exe | 1 | 20 ~ 30 M |
Context switching rate should be high, referring:
It says \Processor(_Total)\% Privileged Time
, here shows \Process(tunnel)\% Privileged Time
and \Process(sslocal)\% Privileged Time
, which should tell something.
tunnel.exe
sslocal.exe
All data are included in the attachment above.
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.
Does waiting consume lots of CPU resource?
No- rather, when go goes into the syscall state, it takes the os thread with it. My understanding is that it has 1us to return before go spins up another os thread to handle the goroutines that were previously being handled by the os thread that was taken for the syscall. The thing is that if there's not enough work to sustain an additional os thread than what it was previously using (which is most cases, since go is generally using the correct number of threads), go will generally keep spinning up and shutting down os threads, causing an obscene amount of context switches, which seems to be what's happening here.
I would say that go's handling of goroutines returning from syscall is generally poor when it can't put everything back exactly where it found it, and getting to the point where this code maintained two os threads due to the usage patterns in play would probably be ideal, even though making these calls nonblocking would be a quick fix
btw there's another case like this (poor decisions about when to spin down os threads cause insane context switches) I'm aware of that isn't cgo related: #21827
In that case, parking an os locked thread with sleep or a blocking channel read will cause the os thread to be spun down since it isn't doing any work, the goroutine is moved to a thread where it can't run, and then the locked thread has to be awoken when the goroutine unparks, causing two context switches.
@lifenjoiner Could you collect a runtime trace (https://pkg.go.dev/runtime/trace#Start) from this example? Just collect for a few seconds, otherwise you'll probably get a huge trace file.
@prattmic Many thanks for diving in!
Here is a trace file for 30s while downloading is stable. trace-30s.zip
Method: https://pkg.go.dev/net/http/pprof#hdr-Usage_examples
curl -o trace-30s.out http://localhost:6060/debug/pprof/trace?seconds=30
Append: One more includes download startup: trace-30s-includes-download-startup.zip
Hey @prattmic I saw that you were assigned #44343 and it is scheduled for 1.21. If the goal is to deal with tasks that are a blocker from Windows users upgrading go to the latest version, this is a big one, and while it raises some interesting high level questions about the way that the scheduler handles things, the task as reported may not be that complicated. Network calls on windows appear to be synchronous and they need to be made asynchronous in order to stop this behavior I believe. I don't see how it's possible to run a web service on windows in Go 1.21 without this task being fixed.
@lifenjoiner could you run your benchmarks using go tip? Some weeks ago I submitted an improvement that asyncify some udp/tcp operations: CL 463839.
@qmuntal Thanks for your work! Following are my benchmarks.
Base Info
src/internal/poll/fd_windows.go
. Confirmed it works by making a syntax mistake of the file did trigger a compiling error.set GOARCH=amd64
to build PE64 files. Keep the same page.--limit-rate
is added to my previous downloading commands to make similar tests. But it is NOT at a uniform speed as before. I saw peaks and pause of the downloading speed and CPU usage by Process Explorer. So, another group of tests with a higher downloading speed was made.Results
--limit-rate 1M
, 2.753 / 0.293 = 9.4. Previous: 6.367 / 0.543 = 11.7.tunnel.exe
$>curl.exe --limit-rate 1M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 1036k 0 0:06:22 0:06:22 --:--:-- 1272k
sslocal.exe
$>curl.exe --limit-rate 1M -o nul -x http://127.0.0.1:7080 <url>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 1033k 0 0:06:23 0:06:23 --:--:-- 1184k
--limit-rate 3M
, 8.541 / 0.729 = 11.7.tunnel.exe
$>curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 <url>/WeCom_4.1.0.6007.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 3000k 0 0:02:12 0:02:12 --:--:-- 1005k
sslocal.exe
$>curl.exe --limit-rate 3M -o nul -x http://127.0.0.1:7080 <url>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 387M 100 387M 0 0 3203k 0 0:02:03 0:02:03 --:--:-- 4164k
Conclusion
Seems there are still more improvements can be done.
Conclusion Seems there are still more improvements can be done.
I'm having a hard time interpreting those results. Has there been an improvement on the metric you are using to benchmark this issue? I want to know if CL 463839 goes in the right direction or is something unrelated.
The screenshots are reports generated by typeperf
. I've selected the valid duration and % User Time
. It shows the average under the scroll bar.
I don't know if CL 463839 is the right direction. Here are my new benchmarks on the same PC and network. CL463839 Vs NoPatch. Both with 10 samples.
My network is not very stable. Maybe you can remeasure them by yourself.
Prerequisite
benchmark.bat
. The script that I used.@echo off
setlocal ENABLEDELAYEDEXPANSION
pushd %~dp0
::-code-::
for /L %%i in (0, 1, 9) do (
title %%i %date% %time%
start /min typeperf -f csv -si 00:01 -cf typeperf-process-counters.txt -o typeperf-process-counters-%%i.csv
sleepy 2s
curl.exe --limit-rate 3M -o nul --resolve dldir1.qq.com:443:127.0.0.1 https://dldir1.qq.com/wework/work_weixin/WeCom_4.1.3.6008.exe
sleepy 2s
taskkill /f /im typeperf.exe
)
::-code-::
:Exit
popd
endlocal
exit /b
You may choose a URL by your convenient.
typeperf-process-counters.txt
\Process(tunnel)\% Privileged Time
\Process(tunnel)\% Processor Time
\Process(tunnel)\% User Time
CL463839.exe
and NoPatch.exe
tunnel.exe -localaddr=127.0.0.1:443 -remoteaddr=dldir1.qq.com:443
Compile them by yourself. All other files and logs are in the attached zip file.
Data Files
CL463839-Vs-NoPatch_10-samples.zip benchmark.xlsx
Analysis
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo.
@CannibalVox WSARecv
is non-blocking, have verified that using @lifenjoiner's tunnel.go
.
@lifenjoiner your initial guess seems to be right, there are a lot of context switches due to the high amount of cgo calls. I haven't seen any low hanging fruit in the net
nor internall/poll
that could mitigate that, so I'm inclined to say that this is a duplicated of #19574.
I don't understand, I thought sub-1us cgo calls don't cause context switches
Also #19574 didn't repro when I tried it a few months ago
@qmuntal Thanks for your efforts!
Hoping someone will make more progress.
Change https://go.dev/cl/562915 mentions this issue: runtime: don't call lockOSThread for every syscall call on Windows
@lifenjoiner CL 562915 might fix your issue. It removes the LockOSThread
call from syscall.Syscall
, which is a cause of context switches. Could you give it a try?
To build a Go application using a specific commit, follow these instructions:
go install golang.org/dl/gotip@latest
gotip download 562915
gotip build .
!!! This seems like a really big deal, @qmuntal, thank you very much!
Seems better. Great work! @qmuntal
If I read your charts correctly, CL 562915 reduced User Time from ~6% to ~2% and Privileged Time from ~50% to ~6%. Given that the Privileged Time is now way less than the high-context-switch threshold (40% according to https://github.com/golang/go/issues/58336#issuecomment-1420216511), IMO we can close this issue.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes. It is long standing.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I think this is a new case of https://github.com/golang/go/issues/19574, taking effect on higher level
Conn
and specific platform Windows. It is suggested to be a new issue.This is a long term user observation.
It is traced into
cgocall
bypprof
(high percentage): on Windows,Conn
Read
/Write
finally calls APIWSARecv
/WSASend
accordingruntime.cgocall
.It is a very basic functionality for web servers, proxies, and other network data transferring tools.
Low performance will pop up when there is a large amount of data, with high CPU usage (for example 10%+, task manager, OS level) which could be 10+ times than rust/c developed similar apps.
Since the scenario is network data transferring, I don't have a simple/short example that can reproduce the issue . Anyone who can generate it is appreciated.Proxies are the most typical case (both
Read
/Write
). My example is bellow in the reply.Test existing programs if you are interested
~~Maybe we can get started with some high rated open source proxies for now.~~ Here I used the client as a local proxy without any upstream proxies, and [wget](https://github.com/lifenjoiner/wget-for-windows/releases) (single thread) to download via it. **Attention**: downloading self hosted resources on the same machine doesn't help reproduce the difference. Loopback MSS = 65495 is adopted. go: [xray](https://github.com/XTLS/Xray-core/releases), [v2ray](https://github.com/v2fly/v2ray-core/releases), and [glider](https://github.com/nadoo/glider/releases). ``` xray.exe -config direct.json ``` ``` v2ray.exe run -config direct.json ``` ``` wget.exe -c -O nul -e https_proxy=127.0.0.1:5080Mitigation An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:
My experimental local proxy with big buffer.
I have tried
ReadFrom
, and observed 4k length packets are transferred on loopback interface, triggering high CPU usage.What did you expect to see?
Lower CPU usage. OS level.
What did you see instead?
High CPU usage.