runtime: 40us pause in file-backed serve loops

dbentley commented 8 years ago

Summary: An idiomatic serve loop that reads from a file (as opposed to a socket) pauses for 40us before the request goroutine starts. We isolated this from a FUSE filesystem into a small benchmark.

The idiomatic serve loop (Cf. src/net/http/server.go Serve and bazil.org/fuse/fs/serve.go Serve) processes a request on a new goroutine:

for {
  req := accept()
  go process(req)
}

We compared it against a no-concurrency serial server:

for { 
  req := accept()
  process(req)
}

We measured the overhead (from the end of accept() to beginning of process()). Median of 100k runs for each strategy: Serial: 441ns Goroutine: 38us

For context, we've found the rest of our logic can run in ~10us, so the go scheduling overhead would be 400%

We tried a channel:

go func() {
  for {
    process(<-ch)
  }
}
for {
  req := accept()
  ch <- req
}

Which is comparable to a Goroutine: Serial: 441ns Goroutine: 38us Channel: 40us

Most servers read/accept from a socket, which the Go runtime implements using polling via the netpoller. FUSE reads from a file, which uses a blocking syscall. AIUI, the Go runtime will let the thread make the syscall, and then another thread of the runtime will notice it's blocked after ~20 us, and then reschedule work. These numbers seem similar which makes me think they're related.

Our workaround: service one process (both accept and process) on one goroutine and hand off the next request to a new goroutine. Strategy "handoff" looks like:

func serve() {
  req := accept()
  go serve()
  process(req)
}

Serial: 441ns Goroutine: 38us Channel: 40us Handoff: 671ns

I'm seeing this on Go version 1.6 darwin/amd64

Is this a known issue? Does it affect other platforms? Is our rewriting of the serve loop a known workaround?

Our full benchmark: http://play.golang.org/p/wc6hPnN778

Build it into a binary fileping. To run: rm a b; mkfifo a; mkfifo b; ./bin/fileping -print -strategy handoff < b > a& ./bin/fileping -strategy serial < a > b& echo 000000000000000 > a

(Modify the -strategy argument in the first command to try different strategies)

bradfitz commented 8 years ago

/cc @aclements

minux commented 8 years ago

The measurement ignores time the connection spent in the kernel queue before Go picks them up.

If you measure request latency from the client, I doubt the handoff based solution helps much (it just moves the latency from measured time to unmeasured time so the measured latency might appear to be less, but the latency is actually still there.)

Optimizing the delay before a new goroutine starts running will have a negative impact on system throughput, so we must find a balance here (as you've discovered, the obvious solution is for the runtime to immediately preempt the current goroutine and execute the newly created one, but then the accept goroutine must migrate to another thread, and that will hurt future connections.)

dbentley commented 8 years ago

The benchmark that I showed has two servers/clients talking to each other. With strategy goroutine, the program takes 19s. With strategy handoff it takes 1.9s, so there is an improvement in the RTT.

Would it help if I made a benchmark that captured the latency from the client side?

golang / go

runtime: 40us pause in file-backed serve loops #15110