proposal: syscall: add SyscallBlock

ianlancetaylor commented 5 years ago

In https://github.com/google/netstack/blob/master/tcpip/link/rawfile/blockingpoll_amd64.s the authors chose to write their own copy of the Syscall so that they could call runtime.entersyscallblock rather than runtime.entersyscall. This was reportedly done to improve TCP round trip time when using gVisor (https://github.com/google/gvisor) from 182us to 112us. Presumably this is because using entersyscallblock, which expects the system call to block, releases the blocking P to immediately pick up some other waiting task, rather than waiting for the system monitor thread to notice that the thread has blocked.

That code is currently making this performance gain in an unsafe way, by using a go:linkname comment to reach into an undocumented and unsupported part of the runtime package.

I propose that we provide a safe mechanism for this, by adding syscall.SyscallBlock (and syscall.Syscall6Block) which would provide a way to make a system call that is expected to block. Using syscall.Syscall for a syscall that blocks remains safe; the P will be blocked until the system monitor notices. Using syscall.SyscallBlock for a system call that does not block will be safe; the thread will just immediately wait for the next available P.

After adding this call to the syscall package, we would then of course add it to the golang.org/x/sys packages.

We could also make this change only in the golang.org/x/sys packages. The advantage of doing so is that the syscall package is frozen (http://golang.org/s/go1.4-syscall). The disadvantage is that it would require the x/sys package to reach into the runtime package in a way that will not be visible during a build of the standard library. That choice is workable but seems to me to be slightly less good.

CC @iangudger

bradfitz commented 5 years ago

This proposal makes no mention (at least directly) of RawSyscall{,6}:

https://golang.org/pkg/syscall/#RawSyscall

So this means we'd end up with three pairs of functions to make a system call, right? raw, normal, and block.

Part of this bug would then be to document the three ways. Currently there are zero docs on the existing two pairs.

bradfitz commented 5 years ago

unlike the occasionally seen suggestion that we have a way to make a system call that calls neither runtime.entersyscall nor runtime.entersyscallblock

Don't we have that, though? Isn't that RawSyscall? Sure looks like it. Or am I misreading your comment?

ianlancetaylor commented 5 years ago

Gah, you're right about RawSyscall. Perhaps nobody uses it as it is not documented. I will delete that paragraph.

beoran commented 5 years ago

I would say the whole syscall package needs more documentation.

ianlancetaylor commented 5 years ago

Since I forgot about RawSyscall, we could also call this proposed function BlockingSyscall.

rsc commented 5 years ago

@dvyukov, do you have any bright ideas about how we could avoid exposing these two different kinds of syscall? That is, is there anything we could do to make the regular 'entersyscall' faster for gVisor? That would be better all around than exposing yet another kind of syscall API.

If the thread is blocking for a very tiny amount of time, entersyscall is better, because the blocking work is avoided entirely. If the thread is blocking for a long amount of time, that can't be happening very often so the overhead is irrelevant. So this optimization of using entersyscallblock instead of entersyscall only applies in some limited window of how long the syscall takes: not too short, not too long. I'm skeptical that users will in general be able to predict whether their particular use case falls in that window. Overuse of entersyscallblock will make some programs slower.

iangudger commented 5 years ago

In this case, we test first using RawSyscall to preform a non-blocking read. If the read returns EAGAIN, we have more information than the runtime: we know that there is a very high probability that poll will block. The standard Syscall function is optimized for the case where the caller does not know if the call will block. We could use Syscall with a blocking read call, but here too we have more information. We know that the vast majority of read calls will succeed.

We already have the RawSyscall function which provides a hint to the runtime that the call won't block or at least won't block long. A SyscallBlock or BlockingSyscall function would round out the trio.

dvyukov commented 5 years ago

I don't have any magical solutions. I think the long term solution is to use OS-specific user-mode-scheduling features to implement efficient entersyscall without sysmon polling. If we do that we will need just 1 syscall version that will do the right thing, and all blocking and long syscalls will become faster. Blocking will release P immediately, long will on the other hand not release P for longer (CPU is busy anyway). sysmon always looked like a hack, just turned out to work reasonably well. Windows have UMS subsystem starting from Vista I think. But we still did not open-source out fibers/switchto patches for linux. I was tinkering if perf tracing of sched events can be used for this. perf can trace context switches: https://elixir.bootlin.com/linux/v5.0-rc2/source/include/uapi/linux/perf_event.h#L372 and if tracing is done only for the current process, it should not require any privileges. So we could read out sched_out events for our process and if the blocking thread happens to be in a syscall/cgo, we retake P. We would still need to retake Ps after some time as we promise that GOMAXPROCS is the number of threads running Go code, but this back up retaking can happen on significantly lower rate (100ms or something). Kernel also says if the task was actually blocked, or just preempted: https://elixir.bootlin.com/linux/v5.0-rc2/source/kernel/events/core.c#L7605 which is nice because we don't need to retake P on preemption. But it seems that kernel will provide prev tid only in whole CPU profiling mode: https://elixir.bootlin.com/linux/v5.0-rc2/source/kernel/events/core.c#L7558 And that should require some privileges (besides dumping tons of uninteresting events). It looks like just an unimplemented part, because kernel should be able to provide pid/tid of the current process only in process-profiling mode (i.e. don't disclose tid's of other processes, but always provide tid for the current process). And without tid we know that some thread blocked, but we don't know which one...

There also few other incremental improvements like checking thread state in sysmon. I.e. sysmon could retake P when it first sees a thread in syscall if the thread status is blocked. This may reduce latency in half. Another potential improvement is making sysmon not a dedicated thread, but rather integrate it into scheduler/netpoll so that when it retakes a P it immediately starts running rather then wakes up another thread. But hard to say if these intermediate solutions worth the additional complexity.

There were also several proposals to export a more flexible network polling which allows to not use goroutine per read and not preallocate read buffers. That's what netstack tries to achieve, right?

iangudger commented 5 years ago

netstack is aiming for high TCP performance with low CPU usage. We have gone to great lengths to improve both of these. If you look through the repo, you will find other hacks (e.g. netstack's sleep package).

rsc commented 5 years ago

I'm still uncomfortable committing to long-term API made available to everyone, just for gVisor's very unique needs. gVisor already reaches into the runtime internals for some things, and that seems preferable, even if it is a little more work for gVisor to keep up.

Stepping back a bit, why does gVisor avoid the runtime's poller? If there's something the runtime poller is doing badly, why don't we fix that for everyone instead of giving out tons of low-level knobs?

dvyukov commented 5 years ago

Other people who decided that runtime poller is not suitable for them mentioned memory overhead of goroutine per connection. I am not sure about preallocation of read buffers (Windows allows issuing 0-sized reads, but I am not sure we support this; it's also possible to issue 1-byte read first and then read out the rest, but this becomes messy and won't work for 1-byte packets; also won't work with any standard packages as they tend to allocate buffers eagerly; but maybe standard packages are not used in these contexts anyway). Overall I think people just want raw readiness notifications delivered to the specified goroutine/channel.

iangudger commented 5 years ago

The current implementation of netpoll has some performance problems. (See golang.org/cl/78915 for more information.) For this reason, we don't allow use of netpoll in gVisor.

prattmic commented 5 years ago

To go into more detail, there are a couple reasons we avoid netpoll:

We have a small, fixed set of FDs we are polling on (I think it is typically three). Since there are so few, having one system thread per FD is fine, and receiving the notifications directly slightly improves latency.
As Ian alludes to, using netpoll makes runtime.findrunnable more expensive. It adds a bit more latency to context switching, and adds significantly to background CPU usage, which we are sensitive to (though less so than latency). This overhead is particularly notable in workloads with lots of context switching (in our case, mostly goroutines blocking and unblocking each other in quick succession).

In addition, gVisor often runs in platforms (e.g., KVM) where system calls can be 2-3x more expensive than native Linux processes. This can exacerbate overhead in places like findrunnable, which behave OK natively, but degrade with more expensive syscalls.

I'll see if I can dig up the numbers we found for (1) and (2).

prattmic commented 5 years ago

I unfortunately can't share the complete details, but for (2), we had a real user application running in gVisor, where netpoll consumed 10% of cycles. Skipping the netpoll check (http://golang.org/cl/78915) saved those cycles and improved the application real time runtime by 5%*.

*Along with http://golang.org/cl/78538, which reduced the usleep from 8% of cycles to 2%.

netpoll_pprof

cc @nixprime

dvyukov commented 5 years ago

Interesting, so it's not memory consumption it's latency and CPU usage. netpoll is now integrated into scheduler and it generally avoids context switches for new network work, i.e. a thread receives a notification from epoll and next moment switches to running the unblocked goroutine. It's interesting that you still see latency reduction after switching to manual poll. usleep time goes from runqgrab, not from netpoll, so it's also interesting that you eliminated that time. I am assuming that time you saved in usleep did not pop up in another place :) I guess there are more things to tune in netpoll. I don't see fundamental reasons why it should be slower than your manual solution. Now that we have execution tracer it would be interesting to see these sources of latency the trace.

rsc commented 5 years ago

Dmitry says "I don't see fundamental reasons why it should be slower than your manual solution."

That seems like a pretty clear signal that we should not complicate the syscall API and instead should be looking at making the poller work well enough that gVisor doesn't need to reimplement it.

Let's focus on that instead, please.

golang / go

proposal: syscall: add SyscallBlock #29734