Closed ianlancetaylor closed 5 years ago
This proposal makes no mention (at least directly) of RawSyscall{,6}:
https://golang.org/pkg/syscall/#RawSyscall
So this means we'd end up with three pairs of functions to make a system call, right? raw, normal, and block.
Part of this bug would then be to document the three ways. Currently there are zero docs on the existing two pairs.
unlike the occasionally seen suggestion that we have a way to make a system call that calls neither runtime.entersyscall nor runtime.entersyscallblock
Don't we have that, though? Isn't that RawSyscall? Sure looks like it. Or am I misreading your comment?
Gah, you're right about RawSyscall
. Perhaps nobody uses it as it is not documented. I will delete that paragraph.
I would say the whole syscall package needs more documentation.
Since I forgot about RawSyscall
, we could also call this proposed function BlockingSyscall
.
@dvyukov, do you have any bright ideas about how we could avoid exposing these two different kinds of syscall? That is, is there anything we could do to make the regular 'entersyscall' faster for gVisor? That would be better all around than exposing yet another kind of syscall API.
If the thread is blocking for a very tiny amount of time, entersyscall is better, because the blocking work is avoided entirely. If the thread is blocking for a long amount of time, that can't be happening very often so the overhead is irrelevant. So this optimization of using entersyscallblock instead of entersyscall only applies in some limited window of how long the syscall takes: not too short, not too long. I'm skeptical that users will in general be able to predict whether their particular use case falls in that window. Overuse of entersyscallblock will make some programs slower.
In this case, we test first using RawSyscall
to preform a non-blocking read. If the read returns EAGAIN, we have more information than the runtime: we know that there is a very high probability that poll will block. The standard Syscall
function is optimized for the case where the caller does not know if the call will block. We could use Syscall
with a blocking read call, but here too we have more information. We know that the vast majority of read calls will succeed.
We already have the RawSyscall function which provides a hint to the runtime that the call won't block or at least won't block long. A SyscallBlock
or BlockingSyscall
function would round out the trio.
I don't have any magical solutions. I think the long term solution is to use OS-specific user-mode-scheduling features to implement efficient entersyscall without sysmon polling. If we do that we will need just 1 syscall version that will do the right thing, and all blocking and long syscalls will become faster. Blocking will release P immediately, long will on the other hand not release P for longer (CPU is busy anyway). sysmon always looked like a hack, just turned out to work reasonably well. Windows have UMS subsystem starting from Vista I think. But we still did not open-source out fibers/switchto patches for linux. I was tinkering if perf tracing of sched events can be used for this. perf can trace context switches: https://elixir.bootlin.com/linux/v5.0-rc2/source/include/uapi/linux/perf_event.h#L372 and if tracing is done only for the current process, it should not require any privileges. So we could read out sched_out events for our process and if the blocking thread happens to be in a syscall/cgo, we retake P. We would still need to retake Ps after some time as we promise that GOMAXPROCS is the number of threads running Go code, but this back up retaking can happen on significantly lower rate (100ms or something). Kernel also says if the task was actually blocked, or just preempted: https://elixir.bootlin.com/linux/v5.0-rc2/source/kernel/events/core.c#L7605 which is nice because we don't need to retake P on preemption. But it seems that kernel will provide prev tid only in whole CPU profiling mode: https://elixir.bootlin.com/linux/v5.0-rc2/source/kernel/events/core.c#L7558 And that should require some privileges (besides dumping tons of uninteresting events). It looks like just an unimplemented part, because kernel should be able to provide pid/tid of the current process only in process-profiling mode (i.e. don't disclose tid's of other processes, but always provide tid for the current process). And without tid we know that some thread blocked, but we don't know which one...
There also few other incremental improvements like checking thread state in sysmon. I.e. sysmon could retake P when it first sees a thread in syscall if the thread status is blocked. This may reduce latency in half. Another potential improvement is making sysmon not a dedicated thread, but rather integrate it into scheduler/netpoll so that when it retakes a P it immediately starts running rather then wakes up another thread. But hard to say if these intermediate solutions worth the additional complexity.
There were also several proposals to export a more flexible network polling which allows to not use goroutine per read and not preallocate read buffers. That's what netstack tries to achieve, right?
netstack is aiming for high TCP performance with low CPU usage. We have gone to great lengths to improve both of these. If you look through the repo, you will find other hacks (e.g. netstack's sleep package).
I'm still uncomfortable committing to long-term API made available to everyone, just for gVisor's very unique needs. gVisor already reaches into the runtime internals for some things, and that seems preferable, even if it is a little more work for gVisor to keep up.
Stepping back a bit, why does gVisor avoid the runtime's poller? If there's something the runtime poller is doing badly, why don't we fix that for everyone instead of giving out tons of low-level knobs?
Other people who decided that runtime poller is not suitable for them mentioned memory overhead of goroutine per connection. I am not sure about preallocation of read buffers (Windows allows issuing 0-sized reads, but I am not sure we support this; it's also possible to issue 1-byte read first and then read out the rest, but this becomes messy and won't work for 1-byte packets; also won't work with any standard packages as they tend to allocate buffers eagerly; but maybe standard packages are not used in these contexts anyway). Overall I think people just want raw readiness notifications delivered to the specified goroutine/channel.
The current implementation of netpoll has some performance problems. (See golang.org/cl/78915 for more information.) For this reason, we don't allow use of netpoll in gVisor.
To go into more detail, there are a couple reasons we avoid netpoll:
We have a small, fixed set of FDs we are polling on (I think it is typically three). Since there are so few, having one system thread per FD is fine, and receiving the notifications directly slightly improves latency.
As Ian alludes to, using netpoll makes runtime.findrunnable more expensive. It adds a bit more latency to context switching, and adds significantly to background CPU usage, which we are sensitive to (though less so than latency). This overhead is particularly notable in workloads with lots of context switching (in our case, mostly goroutines blocking and unblocking each other in quick succession).
In addition, gVisor often runs in platforms (e.g., KVM) where system calls can be 2-3x more expensive than native Linux processes. This can exacerbate overhead in places like findrunnable, which behave OK natively, but degrade with more expensive syscalls.
I'll see if I can dig up the numbers we found for (1) and (2).
I unfortunately can't share the complete details, but for (2), we had a real user application running in gVisor, where netpoll consumed 10% of cycles. Skipping the netpoll check (http://golang.org/cl/78915) saved those cycles and improved the application real time runtime by 5%*.
*Along with http://golang.org/cl/78538, which reduced the usleep from 8% of cycles to 2%.
cc @nixprime
Interesting, so it's not memory consumption it's latency and CPU usage. netpoll is now integrated into scheduler and it generally avoids context switches for new network work, i.e. a thread receives a notification from epoll and next moment switches to running the unblocked goroutine. It's interesting that you still see latency reduction after switching to manual poll. usleep time goes from runqgrab, not from netpoll, so it's also interesting that you eliminated that time. I am assuming that time you saved in usleep did not pop up in another place :) I guess there are more things to tune in netpoll. I don't see fundamental reasons why it should be slower than your manual solution. Now that we have execution tracer it would be interesting to see these sources of latency the trace.
Dmitry says "I don't see fundamental reasons why it should be slower than your manual solution."
That seems like a pretty clear signal that we should not complicate the syscall API and instead should be looking at making the poller work well enough that gVisor doesn't need to reimplement it.
Let's focus on that instead, please.
In https://github.com/google/netstack/blob/master/tcpip/link/rawfile/blockingpoll_amd64.s the authors chose to write their own copy of the
Syscall
so that they could callruntime.entersyscallblock
rather thanruntime.entersyscall
. This was reportedly done to improve TCP round trip time when using gVisor (https://github.com/google/gvisor) from 182us to 112us. Presumably this is because usingentersyscallblock
, which expects the system call to block, releases the blocking P to immediately pick up some other waiting task, rather than waiting for the system monitor thread to notice that the thread has blocked.That code is currently making this performance gain in an unsafe way, by using a
go:linkname
comment to reach into an undocumented and unsupported part of the runtime package.I propose that we provide a safe mechanism for this, by adding
syscall.SyscallBlock
(andsyscall.Syscall6Block
) which would provide a way to make a system call that is expected to block. Usingsyscall.Syscall
for a syscall that blocks remains safe; the P will be blocked until the system monitor notices. Usingsyscall.SyscallBlock
for a system call that does not block will be safe; the thread will just immediately wait for the next available P.After adding this call to the syscall package, we would then of course add it to the golang.org/x/sys packages.
We could also make this change only in the golang.org/x/sys packages. The advantage of doing so is that the syscall package is frozen (http://golang.org/s/go1.4-syscall). The disadvantage is that it would require the x/sys package to reach into the runtime package in a way that will not be visible during a build of the standard library. That choice is workable but seems to me to be slightly less good.
CC @iangudger