m-lab / ndt-server

docker native ndt5 and ndt7 server with prometheus integration
https://www.measurementlab.net/
Apache License 2.0
99 stars 40 forks source link

Periodic 2 ms pauses in output stream compromise BBR accuracy #132

Open mattmathis opened 5 years ago

mattmathis commented 5 years ago

ndt-server pauses output the stream for approximately 2 ms every 135 ms (every 150 kB of data) for the path in the attached trace.

All 3 parameters fluctuate (10%?) from event to event.

These pauses impair BBR's control accuracy because RTT and rate samples that include "application limited" intervals are not trusted, and are blocked from being used by some of the control algorithms.

I see the pauses in both legacy ndt (using CUBIC) and ndt7 (using BBR) modes.

This traces also exhibits a related BBR bug which requires a separate investigation. (Probably fixed in later versions). [Incorrect image removed]

mattmathis commented 5 years ago

Here is the correct image. Screenshot from 2019-06-10 13-13-17

mattmathis commented 5 years ago

This trace was collected from a download test from 2001:1900:2100:2d::88 (ndt.mlab2.lga0t) to a wired connection at my house in Mountain View. Between: 1559953970 Sat Jun 8 00:32:50 UTC 2019 Sat Jun 8 00:32:50 UTC 2019 1559954017 Sat Jun 8 00:33:37 UTC 2019 Sat Jun 8 00:33:37 UTC 2019 06080032-triad.tcpdump and 06080032-tcpdump

bassosimone commented 5 years ago

As we discussed yesterday with @mattmathis, this smells like a GC related issue.

bassosimone commented 5 years ago

With @evfirerob we've run several experiments. At this stage, I'd say that, if we're witnessing a GC related issue, that's very difficult to reveal. These are the experiments that we've run with v0.9.0.:

  1. run the server (at v0.9.0) with export GODEBUG=gctrace=1 and perform a web based test from my home connection (i.e. via Wi-Fi). The output did not show any invocation of the GC during the download. However, the GC is invoked several times during the upload:
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.766904621Z","message":"downloadOrUpload: upgrading to WebSockets"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767141871Z","message":"downloadOrUpload: opening results file"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767650274Z","message":"saver: start"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767675308Z","message":"saver: start waiting for server and client"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767722395Z","message":"receiver: start"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767836781Z","message":"measurer: start"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767664309Z","message":"sender: start"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:15:54.767988012Z","message":"sender: generating random buffer"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:04.768282616Z","message":"measurer: context done"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:04.76836273Z","message":"measurer: stop"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:04.900804784Z","message":"sender: sending Close message"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:04.900885718Z","message":"sender: stop"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:05.462146013Z","message":"receiver: stop"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:05.462317795Z","message":"saver: stop waiting for server and client"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:05.462341367Z","message":"saver: stop"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:05.462348444Z","message":"saver: start draining zipch"}
{"fields":{},"level":"debug","timestamp":"2019-06-11T10:16:05.462360822Z","message":"saver: stop draining zipch"}
[REDACTED] - - [11/Jun/2019:10:15:54 +0000] "GET /ndt/v7/download?library.name=libndt7.js&library.version=0.7 HTTP/1.1" 200 0

[snip]

gc 1 @14.735s 0%: 0.035+1.0+0.021 ms clock, 0.14+0.31/0.58/0.86+0.084 ms cpu, 4->4->1 MB, 5 MB goal, 4 P

[snip]
  1. I enabled tracing using this diff:
diff --git a/logging/logging.go b/logging/logging.go
index 693cd4d..c718021 100644
--- a/logging/logging.go
+++ b/logging/logging.go
@@ -26,5 +26,5 @@ var Logger = log.Logger{
 // access logs, because access logs are a fairly standard format that
 // has been around for a long time now, so better to follow such standard.
 func MakeAccessLogHandler(handler http.Handler) http.Handler {
-       return handlers.LoggingHandler(os.Stdout, handler)
+       return handlers.LoggingHandler(os.Stderr, handler)
 }
diff --git a/ndt7/handler/handler.go b/ndt7/handler/handler.go
index 5bb8f96..0843910 100644
--- a/ndt7/handler/handler.go
+++ b/ndt7/handler/handler.go
@@ -5,6 +5,8 @@ import (
        "context"
        "fmt"
        "net/http"
+       "os"
+       "runtime/trace"

        "github.com/gorilla/websocket"
        "github.com/m-lab/go/warnonerror"
@@ -77,6 +79,8 @@ func (h Handler) downloadOrUpload(writer http.ResponseWriter, request *http.Requ

 // Download handles the download subtest.
 func (h Handler) Download(writer http.ResponseWriter, request *http.Request) {
+       trace.Start(os.Stdout)
+       defer trace.Stop()
        h.downloadOrUpload(writer, request, spec.SubtestDownload, download.Do)
 }

and obtained this analysis of the life of the sender goroutine, when running a test from my house:

Screenshot 2019-06-11 at 15 52 09

where most of the time was spent in the write system call and no GC was invoked.

This is instead from a test run from cloud to cloud:

Screenshot 2019-06-11 at 15 56 03

where instead most of the time was spent spinning and again there was no GC involved.

These two data points seems to suggest we're not hunting GC issues.

While performing several tests, we also noticed that fq_codel was enabled, rather than fq. It is my understanding that BBR expects to be used along with fq rather than fq_codel because of packet pacing with Linux < 4.13. Since I'm running Linux 5.0 in GCE, these tests are probably not affected, but IIRC the platform is on 4.9 and also the CoreOS based nodes should be immune because they're at 4.19.25-coreos. (So, this is actually a non finding.)

mlab-code-reviews commented 5 years ago

If we get a go tool trace log, we should be able to match the GC cycles with the 2ms gaps. If GC is running at the same interval, that would be telling.

Normally, GC will only be running less than once per second, so I'm not convinced it is GC. It still is worthwhile looking at the GC stats, though, if only on a local workstation. I would also expect GC to scale with bandwidth and possibly with number of tests.

bassosimone commented 5 years ago

@gfr10598 you may want to understand why your comment was posted on behalf of @mlab-code-reviews (maybe that's intended behaviour but I still found it rather confusing).

If we get a go tool trace log, we should be able to match the GC cycles with the 2ms gaps. If GC is running at the same interval, that would be telling.

I agree. I can share several trace logs. Maybe we can have a chat later, if you wish?

As I mentioned above and shown in the screenshot, the trace log does not show any GC activation during the download because there is actually nothing to GC. The main loop keeps sending the same prepared message, which is AFAICT akin to continue sending the same static buffer.

(However, a subsequent test, using much larger outgoing messages showed some activations.)

Normally, GC will only be running less than once per second, so I'm not convinced it is GC. It still is worthwhile looking at the GC stats, though, if only on a local workstation. I would also expect GC to scale with bandwidth and possibly with number of tests.

I am also not convinced it's the GC. I've run another test from my home to a GCE server with the GC debugging enabled and with 1<<24 messages (as opposed to the default).

Here's the log during the download (trimmed for readability):

{"timestamp":"2019-06-11T14:43:23.56369389Z","message":"receiver: start"}
gc 1 @7.225s 0%: 0.064+1.2+0.019 ms clock, 0.25+0/0.69/0.82+0.077 ms cpu, 19->19->17 MB, 20 MB goal, 4 P
gc 2 @7.894s 0%: 0.020+7.0+0.006 ms clock, 0.081+0.013/0.81/6.5+0.024 ms cpu, 33->33->33 MB, 35 MB goal, 4 P
gc 3 @7.915s 0%: 0.020+4.7+0.016 ms clock, 0.081+0/0.37/5.5+0.066 ms cpu, 49->49->33 MB, 67 MB goal, 4 P
{"timestamp":"2019-06-11T14:43:39.64449454Z","message":"measurer: context done"}

and during the upload:

{"timestamp":"2019-06-11T14:43:45.9586303Z","message":"upload: starting measurement loop"}
gc 4 @31.194s 0%: 0.017+0.96+0.006 ms clock, 0.068+0.34/0.55/1.0+0.025 ms cpu, 47->47->1 MB, 67 MB goal, 4 P
gc 5 @31.446s 0%: 0.009+0.85+0.006 ms clock, 0.036+0.28/0.49/0.73+0.027 ms cpu, 4->4->1 MB, 5 MB goal, 4 P
gc 6 @31.752s 0%: 0.010+0.98+0.007 ms clock, 0.041+0.23/0.52/0.63+0.031 ms cpu, 4->4->1 MB, 5 MB goal, 4 P

[snip]

gc 36 @39.301s 0%: 0.008+1.0+0.010 ms clock, 0.032+0.18/0.55/0.84+0.040 ms cpu, 4->4->1 MB, 5 MB goal, 4 P
gc 37 @39.614s 0%: 0.010+1.0+0.017 ms clock, 0.041+0.32/0.53/0.73+0.068 ms cpu, 4->4->1 MB, 5 MB goal, 4 P
gc 38 @40.093s 0%: 0.007+1.0+0.004 ms clock, 0.029+0.23/0.64/0.96+0.018 ms cpu, 4->4->1 MB, 5 MB goal, 4 P

So, we have 40 invocations of the GC, most of which during the upload.

bassosimone commented 5 years ago

(However, a subsequent test, using much larger outgoing messages showed some activations.)

I think there's a PEBCAK here. The variable was set outside the container. So I can't tell whether this is ndt7 or docker (also written in Go) that spits out GC info. [Edit: it seems it was Docker.]

pboothe commented 5 years ago

The easiest way to determine whether it is GC is to disable GC by running the server with the environment variable GOGC=off set and see if the problem is still happening. This should be easy to set up in a sandbox branch - add

{
  name: "GOGC",
  value: "off",
},

to https://github.com/m-lab/k8s-support/blob/master/k8s/daemonsets/experiments/ndt.jsonnet#L26 and then push to a branch which matches the glob sandbox-*

pboothe commented 5 years ago

The more I read about things, the more I think it might be fq_codel (instead of fq) as our qdisc and not Go's GC, but that opinion is based on docs not data.

e.g. see @evfirerob's issue: https://github.com/m-lab/ops-tracker/issues/733

bassosimone commented 5 years ago

The more I read about things, the more I think it might be fq_codel (instead of fq) as our qdisc and not Go's GC, but that opinion is based on docs not data.

Yesterday I had a long chat with @mattmathis and he told me this should not be an issue for Linux >= 4.17, because since then a packet pacer is bundled with the kernel and used if needed.

pboothe commented 5 years ago

I would still like to try fq. fq_codel has some smarts that may be inappropriate for our use case. It's normally the right choice, but for our particular case, let's try fq and see if problems go away.

mattmathis commented 5 years ago

In this trace, it is tcpdump that fails to run, but the data transmission continues. TCP requires that there be data (tiny vertical white arrows) for the entire extent to the range (y-axis). Clearly the data proceeds but tcpdump failed to capture the packets.

Screenshot from 2019-06-12 14-38-23

pboothe commented 5 years ago

This smells to me like a locking problem. What happens if we do all of this while tcp-info is turned off? I hope the same bugs occur, but I worry they might not.

bassosimone commented 5 years ago

As we discussed in a recent meeting, this looks less scary than P0, probably P1.

mattmathis commented 5 years ago

kubectl describe node mlab2.lga0t reports in part: Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 412m (1%) 470m (1%) memory 656Mi (4%) 1276Mi (8%) ephemeral-storage 0 (0%) 0 (0%)

Which looks like it might be a limit of 0.47 CPUs (over some average)

bassosimone commented 5 years ago

Since we're starting to suspect that this is a platform issue, I think I'm not the best person to be the assignee of this issue. I'm going to assign this to @pboothe such that there is an assignee and leave to him to choose whether he's the right person to investigate this or whether someone else should be.

pboothe commented 4 years ago

Part of the "validate the platform" KR

stephen-soltesz commented 4 years ago

Please, @mattmathis, could you check whether you're still seeing this issue?