all: kubernetes slower execution since Go 1.4.3

rsc commented 8 years ago

The Kubernetes team found that updating from Go 1.4.2 to Go 1.4.3 caused a performance regression for them, and also that Go 1.5.3 was bad as well.

The 1.4.3 bug is kubernetes/kubernetes#17524, "fixed" by reverting to Go 1.4.2. The error report at the top of the bug is difficult for me to interpret: to me it says mainly that 1 != 0 and 3 != 0. A comment on the next bug says “... but we can't do what InfluxDB did, because we have some pretty massive issues on 1.4.3. 1.4.2 was the last even vaguely happy point.” Right now, I don't have any way to quantify the 1.4.2 to 1.4.3 problem other than that remark. (The InfluxDB note is referring to #14189, but we don't know that they are necessarily related.)

The 1.5.3 bug is kubernetes/kubernetes#20400, "fixed" by reverting to Go 1.4.2 again. That bug has clearer data: on a particular load test, tail latencies are significantly higher in 1.5.3.

The Kubernetes team created kubernetes/kubernetes#20656 to track further investigation of these regressions. This is the corresponding issue on the Go side.

lavalamp commented 8 years ago

Thanks for filing this and again many apologies that we failed to do so!

wojtek-t commented 8 years ago

@rsc - sorry for late response

Yes - the reason for Go 1.4.3 was exactly the same (performance was significantly higher than with Go 1.4.2). And I guess it should be much easier to pin down the reason of that between those two versions than between Go 1.5.3.

Regarding running kubemark - yes, it's definitely possible for you to run it. Details explanation how to do it is here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/kubemark-guide.md

Please let me know if you need any help with that (feel free to also email me directly at wojtekt@google.com)

bradfitz commented 8 years ago

@wojtek-t, I'm really surprised at any regression from 1.4.2 to 1.4.3. Those two releases are basically identical. There are only four changes. See the commits on Sep 21, 2005: https://github.com/golang/go/commits/release-branch.go1.4

cb65428710d70abdaf101defa9cd7eaddff9d925 seems unlikely. It only affects error paths. 8f3395902d3f57887dd11b60d837785107304df4 blocks multiple HTTP Content-Length headers. I doubt you were depending on that. 8f429671248bbcf956fa8a1b7c3a1072285a3b8b is about HTTP header field names with spaces in them. I doubt you were using those. e938de22be4ee63d65152024f75d62482d3d69b4 fixed #10135 but if you hit that, you would be panicing, not getting slower.

So, I suspect either your Kubernetes benchmarks are bogus (or at best noisy and statistically useless), or something else changes in between benchmarking Go 1.4.2 and Go 1.4.3: machine types? GCE zone? Kubernetes rev itself?

Were you ever able to reliably reproduce the purported Go 1.4.2 to Go 1.4.3 performance regression with all else being equal, or was this the result of two sole data points? And if the latter, how far apart in time were those two data points?

wojtek-t commented 8 years ago

@bradfitz - yes - we are very very very reliably able to reproduce it - it wasn't cause by machine types, zone, anything - we wer using exactly the same code and environment and the only difference was Go version.

Regarding our benchmarks - those are not really benchmarks, those are load tests on real cluster etc. And I wouldn't call them bogus or stastically useless - we found a number of regressions in our code and fixed a huge number of problems thanks to them (real problems). So those really work.

It's definitely something around Go version (maybe we are using Go not as expected in some places? - but it has to do with Go version).

wojtek-t commented 8 years ago

@gmarek - FYI

bradfitz commented 8 years ago

@wojtek-t, thanks. This is the first we'd heard about this, so pardon my skepticism. I had to ask the usual questions. I look forward to seeing how this mystery is resolved.

wojtek-t commented 8 years ago

@bradfitz no worries - we were also initially skeptical about this - but reverting back to Go 1.4.2 solve the problem...

OneOfOne commented 8 years ago

@wojtek-t did you run a git bisect by any chance?

rsc commented 8 years ago

@OneOfOne, the difference was the golang:1.4.2 to golang:1.4.3 docker image. There is no opportunity for bisect.

Everyone, please stop suggesting things for the Kubernetes team to try. This one is on us right now, and I'd rather they spend their time answering questions about how to reproduce the problem than answer questions about random debugging they did or did not do. Thanks.

rsc commented 8 years ago

@wojtek-t, thanks for the details. I got stuck, question below. I am trying to reproduce the problem using the kubemark-guide doc you linked to earlier. I tried kubernetes v.1.8.0 since that appeared to be the latest non-alpha, non-beta release, but that git tag seems to have no test/kubemark. So now I am using dff7490, which is the revision that showed the kubemark failure going from 1.4.2 to 1.5.3.

I'm stuck right after 'make quick-release'. From the guide, the next step appears to be

NUM_NODES=1000 MASTER_SIZE=n1-standard-32 ./test/kubemark/start-kubemark.sh

but I get:

$ NUM_NODES=1000 MASTER_SIZE=n1-standard-32 ./test/kubemark/start-kubemark.sh
Please install kubectl before running this script
$

Where does 'make quick-release' write its output, and how do I unpack that output to install kubectl? Or am I supposed to be using some system version of kubernetes? Thanks.

bradfitz commented 8 years ago

@rsc, what is the docker image you refer to?

The official Docker golang one I assume?

One thing I note looking at https://imagelayers.io/?images=golang:1.4.2,golang:1.4.3 is that 1.4.3 added the "procps" package to the image,

RUN apt-get update && apt-get install -y --no-install-recommends bzr git mercurial openssh-client subversion procps && rm -rf /var/lib/apt/lists/*

(in the 4th layer)

... but I don't see that reflected in the commit which updates from Go 1.4.2 to Go 1.4.3: https://github.com/docker-library/golang/commit/a4f3927494b48c7bdb6ea6edac8f89818853b45b

So if procps was silently added, what else changed in the FROM buildpack-deps:jessie-scm base image meanwhile?

rsc commented 8 years ago

@bradfitz, exactly. I'd like to first be able to reproduce the kubemark delta and then the fact that we're talking about the Docker golang image 1.4.2 -> 1.4.3 and not just the Go distribution 1.4.2 -> 1.4.3 seems like a good thing to check. But so far I can't run kubemark at all.

ixdy commented 8 years ago

@rsc Most of the Kubernetes test scripts automatically try to find kubectl in the build output tree. This one apparently doesn't. We should fix that. (Where exactly it ends up depends a little bit on how it's built, but it'd be somewhere under _output.)

You should also be able to install an "official" version with gcloud components install kubectl. I don't think the version of kubectl should matter for this test, though @wojtek-t or @gmarek can confirm.

gmarek commented 8 years ago

It doesn't. You also probably should reduce the number of Nodes to 100 and Master size to 4 Nodes - 1000 Node cluster needs ~100 cores (~70 for Nodes and 32 for Master).

gmarek commented 8 years ago

Note that you also need to start 'external' cluster for Kubemark. For 100 Node one you'd need ~12 cores in Nodes and 2 core master, so edit cluster/gce/config-test.sh by setting NODE_NUM=3 and NODE_SIZE=n1-standard-4, and start a cluster using hack/e2e.go -v -up

bradfitz commented 8 years ago

procps was added in https://github.com/docker-library/buildpack-deps/commit/1845b3f918f69b4c97912b0d4d68a5658458e84f

That Dockerfile is based on the "curl" micro layer, which has no changes, but the curl layer is based on "debian:jessie", whose history I'm finding it hard to track. How do I find the history of a Dockerfile image's tag?

There's this tag-history page: https://github.com/docker-library/docs/blob/master/debian/tag-details.md#debianjessieWhich has history:

https://github.com/docker-library/docs/commits/master/debian/tag-details.md

The commits updating debian:jessie can be found by cloning https://github.com/docker-library/docs.git and looking at e.g. git log -L 97,132:debian/tag-details.md

It appears there were updates on:

Mar 27 2015 .... .... missing history .... Sep 07 2015 Oct 22 2015 Nov 09 2015 Nov 14 2015 Nov 20 2015 Dec 04 2015 Jan 06 2016 Jan 25 2016 Feb 16 2016

Unfortunately, it looks like the bot updating the tag history didn't begin life until Wed Sep 9 11:31:58 2015 -0700 when Tianon Gravi first ran it, before "Docker Library Bot" took over.

So "golang:1.4.3" (updated on Sep 23, 2015) was built with a known "debian:jessie" hash, but I can't figure out what "golang:1.4.2" was built with. Maybe @jfrazelle can help me do archaeology.

rsc commented 8 years ago

OK, still following the doc posted earlier. I can run start-kubemark.sh, but the output looks maybe not quite right:

$ NUM_NODES=100 MASTER_SIZE=n1-standard-4 ./test/kubemark/start-kubemark.sh
docker build -t gcr.io/kubertest2/kubemark .
Sending build context to Docker daemon 58.54 MB
Step 1 : FROM debian:jessie
 ---> 040bf8e08425
Step 2 : COPY kubemark.sh /kubemark.sh
 ---> Using cache
 ---> 0f42238a0dce
Step 3 : RUN chmod a+x /kubemark.sh
 ---> Using cache
 ---> 4db91651de2f
Step 4 : COPY kubemark /kubemark
 ---> 14b35b4c46fa
Removing intermediate container b6a6e29c624f
Successfully built 14b35b4c46fa
gcloud docker push gcr.io/kubertest2/kubemark
The push refers to a repository [gcr.io/kubertest2/kubemark] (len: 1)
14b35b4c46fa: Pushed 
4db91651de2f: Image already exists 
0f42238a0dce: Image already exists 
040bf8e08425: Image already exists 
73e8d4f6bf84: Image already exists 
latest: digest: sha256:b67b4eab4b1099bc4da74231e18f9fba56e5f66bcffb424c4179021d986e2155 size: 7441
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/zones/us-central1-b/disks/kubernetes-kubemark-master-pd].
NAME                          ZONE          SIZE_GB TYPE   STATUS
kubernetes-kubemark-master-pd us-central1-b 20      pd-ssd READY
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/zones/us-central1-b/instances/kubernetes-kubemark-master].
NAME                       ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP EXTERNAL_IP     STATUS
kubernetes-kubemark-master us-central1-b n1-standard-4             10.128.0.2  104.197.228.204 RUNNING
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/global/firewalls/kubernetes-kubemark-master-https].
NAME                             NETWORK SRC_RANGES RULES   SRC_TAGS TARGET_TAGS
kubernetes-kubemark-master-https default 0.0.0.0/0  tcp:443          kubemark-master
Warning: Permanently added '104.197.228.204' (ECDSA) to the list of known hosts.
Warning: Permanently added '104.197.228.204' (ECDSA) to the list of known hosts.
kubernetes-server-linux-amd64.tar.gz                                                                                                      100%  240MB  48.1MB/s   00:05    
start-kubemark-master.sh                                                                                                                  100% 2351     2.3KB/s   00:00    
configure-kubectl.sh                                                                                                                      100% 1003     1.0KB/s   00:00    
Warning: Permanently added '104.197.228.204' (ECDSA) to the list of known hosts.
Unable to find image 'gcr.io/google_containers/etcd:2.2.1' locally
Pulling repository gcr.io/google_containers/etcd
fbea2d67e633: Pulling image (2.2.1) from gcr.io/google_containers/etcd
fbea2d67e633: Pulling image (2.2.1) from gcr.io/google_containers/etcd, endpoint: https://gcr.io/v1/
fbea2d67e633: Pulling dependent layers
bf0f46991aed: Pulling metadata
bf0f46991aed: Pulling fs layer
bf0f46991aed: Download complete
3d5bcd78e074: Pulling metadata
3d5bcd78e074: Pulling fs layer
3d5bcd78e074: Download complete
5673765ccce5: Pulling metadata
5673765ccce5: Pulling fs layer
5673765ccce5: Download complete
d5bf32328e50: Pulling metadata
d5bf32328e50: Pulling fs layer
d5bf32328e50: Download complete
fbea2d67e633: Pulling metadata
fbea2d67e633: Pulling fs layer
fbea2d67e633: Download complete
fbea2d67e633: Download complete
Status: Downloaded newer image for gcr.io/google_containers/etcd:2.2.1
gcr.io/google_containers/etcd: this image was pulled from a legacy registry.  Important: This registry version will not be supported in future versions of docker.
7c805352cdba641901dbfb7bc21d1efe1fd177bff0c0221fe83b2141d2181277
4464daddf21c06e00ca572212edb287132aa1dcb805d895cfa23d4e23b7d9ae0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) couldn't connect to host
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    987      0 --:--:-- --:--:-- --:--:--  2000
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Is that successful output? Looks like maybe not? I am using git checkout dff7490c57f309 because that was where the 1.5.3 regression was recorded.

Since I don't know if that command succeeded, I tried the next one. It definitely says it failed, but I can't quite infer whether it's because the previous command failed or something else. Full output is below but the main complaint seems to be:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:465
  [Performance] should allow starting 30 pods per node [BeforeEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:463

  Expected
      <int>: 0
  not to be zero-valued

I do agree that 0 is in fact zero-valued, but I am not sure what the point is. Full output:

$ test/kubemark/run-e2e-tests.sh
Kubemark master name: kubernetes-kubemark-master
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
Using master: kubernetes-kubemark-master (external IP: 104.197.228.204)
Setting up for KUBERNETES_PROVIDER="kubemark".
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
I0307 14:12:39.417839   52136 e2e.go:237] Starting e2e run "8f35cae5-e498-11e5-92f3-8cdcd443ecd8" on Ginkgo node 1
Running Suite: Kubernetes e2e suite
===================================
Random Seed: 1457377959 - Will randomize all specs
Will run 1 of 221 specs

Mar  7 14:12:39.440: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

Mar  7 14:12:39.445: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Mar  7 14:12:39.651: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Mar  7 14:12:39.651: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
Density [Skipped] 
  [Performance] should allow starting 30 pods per node
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:463
[BeforeEach] Density [Skipped]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 14:12:39.651: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 14:12:39.694: INFO: Skipping waiting for service account
[BeforeEach] Density [Skipped]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:170
[AfterEach] Density [Skipped]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:130
Mar  7 14:12:39.825: INFO: Top latency metric: {Resource:serviceaccounts Verb:PUT Latency:{Perc50:14.491ms Perc90:14.491ms Perc99:14.491ms}}
Mar  7 14:12:39.825: INFO: Top latency metric: {Resource:secrets Verb:POST Latency:{Perc50:8.098ms Perc90:8.098ms Perc99:8.098ms}}
Mar  7 14:12:39.825: INFO: Top latency metric: {Resource:serviceaccounts Verb:POST Latency:{Perc50:1.377ms Perc90:4.631ms Perc99:4.631ms}}
Mar  7 14:12:39.825: INFO: Top latency metric: {Resource:namespaces Verb:POST Latency:{Perc50:1.609ms Perc90:1.609ms Perc99:1.609ms}}
Mar  7 14:12:39.825: INFO: Top latency metric: {Resource:nodes Verb:LIST Latency:{Perc50:650µs Perc90:854µs Perc99:1.012ms}}
Mar  7 14:12:39.825: INFO: API calls latencies: {
  "apicalls": [
    {
      "resource": "serviceaccounts",
      "verb": "PUT",
      "latency": {
        "Perc50": 14491000,
        "Perc90": 14491000,
        "Perc99": 14491000
      }
    },
    {
      "resource": "secrets",
      "verb": "POST",
      "latency": {
        "Perc50": 8098000,
        "Perc90": 8098000,
        "Perc99": 8098000
      }
    },
    {
      "resource": "serviceaccounts",
      "verb": "POST",
      "latency": {
        "Perc50": 1377000,
        "Perc90": 4631000,
        "Perc99": 4631000
      }
    },
    {
      "resource": "namespaces",
      "verb": "POST",
      "latency": {
        "Perc50": 1609000,
        "Perc90": 1609000,
        "Perc99": 1609000
      }
    },
    {
      "resource": "nodes",
      "verb": "LIST",
      "latency": {
        "Perc50": 650000,
        "Perc90": 854000,
        "Perc99": 1012000
      }
    },
    {
      "resource": "pods",
      "verb": "LIST",
      "latency": {
        "Perc50": 168000,
        "Perc90": 860000,
        "Perc99": 860000
      }
    },
    {
      "resource": "horizontalpodautoscalers",
      "verb": "LIST",
      "latency": {
        "Perc50": 515000,
        "Perc90": 661000,
        "Perc99": 661000
      }
    },
    {
      "resource": "serviceaccounts",
      "verb": "GET",
      "latency": {
        "Perc50": 452000,
        "Perc90": 452000,
        "Perc99": 452000
      }
    },
    {
      "resource": "resourcequotas",
      "verb": "LIST",
      "latency": {
        "Perc50": 410000,
        "Perc90": 410000,
        "Perc99": 410000
      }
    },
    {
      "resource": "serviceaccounts",
      "verb": "LIST",
      "latency": {
        "Perc50": 162000,
        "Perc90": 357000,
        "Perc99": 357000
      }
    },
    {
      "resource": "secrets",
      "verb": "LIST",
      "latency": {
        "Perc50": 279000,
        "Perc90": 279000,
        "Perc99": 279000
      }
    },
    {
      "resource": "replicationcontrollers",
      "verb": "LIST",
      "latency": {
        "Perc50": 226000,
        "Perc90": 271000,
        "Perc99": 271000
      }
    },
    {
      "resource": "namespaces",
      "verb": "LIST",
      "latency": {
        "Perc50": 270000,
        "Perc90": 270000,
        "Perc99": 270000
      }
    },
    {
      "resource": "services",
      "verb": "LIST",
      "latency": {
        "Perc50": 208000,
        "Perc90": 232000,
        "Perc99": 232000
      }
    },
    {
      "resource": "jobs",
      "verb": "LIST",
      "latency": {
        "Perc50": 134000,
        "Perc90": 201000,
        "Perc99": 201000
      }
    },
    {
      "resource": "persistentvolumeclaims",
      "verb": "LIST",
      "latency": {
        "Perc50": 170000,
        "Perc90": 170000,
        "Perc99": 170000
      }
    },
    {
      "resource": "persistentvolumes",
      "verb": "LIST",
      "latency": {
        "Perc50": 166000,
        "Perc90": 167000,
        "Perc99": 167000
      }
    }
  ]
}

Mar  7 14:12:40.305: INFO: Scheduling latency: {
  "Scheduling": {
    "Perc50": 0,
    "Perc90": 0,
    "Perc99": 0
  },
  "binding": {
    "Perc50": 0,
    "Perc90": 0,
    "Perc99": 0
  },
  "total": {
    "Perc50": 0,
    "Perc90": 0,
    "Perc99": 0
  }
}

[AfterEach] Density [Skipped]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
STEP: Collecting events from namespace "e2e-tests-density-fors9".
Mar  7 14:12:40.386: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Mar  7 14:12:40.386: INFO: 
Mar  7 14:12:40.424: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "e2e-tests-density-fors9" for this suite.

• Failure in Spec Setup (BeforeEach) [5.968 seconds]
Density [Skipped]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:465
  [Performance] should allow starting 30 pods per node [BeforeEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:463

  Expected
      <int>: 0
  not to be zero-valued

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:144
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

Summarizing 1 Failure:

[Fail] Density [Skipped] [BeforeEach] [Performance] should allow starting 30 pods per node 
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:144

Ran 1 of 221 Specs in 6.183 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 220 Skipped --- FAIL: TestE2E (6.21s)
FAIL

Ginkgo ran 1 suite in 6.557506861s
Test Suite Failed
!!! Error in test/kubemark/../../cluster/kubemark/../../cluster/gce/../../cluster/../hack/ginkgo-e2e.sh:96
  '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-service-account="${GCE_SERVICE_ACCOUNT:-}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --repo-root="${KUBE_VERSION_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --num-nodes="${NUM_NODES:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" ${E2E_CLEAN_START:+"--clean-start=true"} ${E2E_MIN_STARTUP_PODS:+"--minStartupPods=${E2E_MIN_STARTUP_PODS}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} "${@:-}"' exited with status 1
Call stack:
  1: test/kubemark/../../cluster/kubemark/../../cluster/gce/../../cluster/../hack/ginkgo-e2e.sh:96 main(...)
Exiting with status 1
$

I will try to do the same thing over again and see if I get a different result.

/cc @wojtek-t @gmarek

rsc commented 8 years ago

Doing the same thing over again at the same git revision produced the same result.

Trying kubernetes 492c5d6 instead. Roughly same. The curl chatter is gone but some machine still can't connect to localhost:8080.

$ NUM_NODES=100 MASTER_SIZE=n1-standard-4 ./test/kubemark/start-kubemark.sh
docker build -t gcr.io/kubertest2/kubemark .
Sending build context to Docker daemon 60.17 MB
Step 1 : FROM debian:jessie
 ---> 040bf8e08425
Step 2 : COPY kubemark.sh /kubemark.sh
 ---> Using cache
 ---> 0f42238a0dce
Step 3 : RUN chmod a+x /kubemark.sh
 ---> Using cache
 ---> 4db91651de2f
Step 4 : COPY kubemark /kubemark
 ---> 5d66b160904e
Removing intermediate container c3da4ebcdcba
Successfully built 5d66b160904e
gcloud docker push gcr.io/kubertest2/kubemark
The push refers to a repository [gcr.io/kubertest2/kubemark] (len: 1)
5d66b160904e: Pushed 
4db91651de2f: Image already exists 
0f42238a0dce: Image already exists 
040bf8e08425: Image already exists 
73e8d4f6bf84: Image already exists 
latest: digest: sha256:9108c95f86a8e4fe48bbc4010f304fd7090bd44ee8a8519f187c9687fa419d02 size: 7441
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/zones/us-central1-b/disks/kubernetes-kubemark-master-pd].
NAME                          ZONE          SIZE_GB TYPE   STATUS
kubernetes-kubemark-master-pd us-central1-b 20      pd-ssd READY
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/zones/us-central1-b/instances/kubernetes-kubemark-master].
NAME                       ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP EXTERNAL_IP    STATUS
kubernetes-kubemark-master us-central1-b n1-standard-4             10.128.0.2  23.251.151.145 RUNNING
Created [https://www.googleapis.com/compute/v1/projects/kubertest2/global/firewalls/kubernetes-kubemark-master-https].
NAME                             NETWORK SRC_RANGES RULES   SRC_TAGS TARGET_TAGS
kubernetes-kubemark-master-https default 0.0.0.0/0  tcp:443          kubemark-master
Generating certs for alternate-names: IP:23.251.151.145,IP:10.0.0.1,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-kubemark-master
Warning: Permanently added '23.251.151.145' (ECDSA) to the list of known hosts.
Warning: Permanently added '23.251.151.145' (ECDSA) to the list of known hosts.
kubernetes-server-linux-amd64.tar.gz                                                                                                      100%  239MB  47.8MB/s   00:05    
start-kubemark-master.sh                                                                                                                  100% 2398     2.3KB/s   00:00    
configure-kubectl.sh                                                                                                                      100% 1016     1.0KB/s   00:00    
Warning: Permanently added '23.251.151.145' (ECDSA) to the list of known hosts.
Unable to find image 'gcr.io/google_containers/etcd:2.2.1' locally
Pulling repository gcr.io/google_containers/etcd
fbea2d67e633: Pulling image (2.2.1) from gcr.io/google_containers/etcd
fbea2d67e633: Pulling image (2.2.1) from gcr.io/google_containers/etcd, endpoint: https://gcr.io/v1/
fbea2d67e633: Pulling dependent layers
bf0f46991aed: Pulling metadata
bf0f46991aed: Pulling fs layer
bf0f46991aed: Download complete
3d5bcd78e074: Pulling metadata
3d5bcd78e074: Pulling fs layer
3d5bcd78e074: Download complete
5673765ccce5: Pulling metadata
5673765ccce5: Pulling fs layer
5673765ccce5: Download complete
d5bf32328e50: Pulling metadata
d5bf32328e50: Pulling fs layer
d5bf32328e50: Download complete
fbea2d67e633: Pulling metadata
fbea2d67e633: Pulling fs layer
fbea2d67e633: Download complete
fbea2d67e633: Download complete
Status: Downloaded newer image for gcr.io/google_containers/etcd:2.2.1
gcr.io/google_containers/etcd: this image was pulled from a legacy registry.  Important: This registry version will not be supported in future versions of docker.
8b0521c086fd6f7fbe2c2cb97e1c988a63b7948b92ae2f0eb4f7dae19da36301
bcbd7ee229801370dd0611fe04ff74cff68f696aec515b0d12dced21a8e427cb
The connection to the server localhost:8080 was refused - did you specify the right host or port?
$

and then run-e2e-tests.sh gives a more understandable message:

$ test/kubemark/run-e2e-tests.sh
Kubemark master name: kubernetes-kubemark-master
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
Using master: kubernetes-kubemark-master (external IP: 23.251.151.145)
2016/03/07 14:41:49 e2e.go:194: Running: get status
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
Client Version: version.Info{Major:"1", Minor:"3+", GitVersion:"v1.3.0-alpha.0.194+492c5d68fbedea-dirty", GitCommit:"492c5d68fbedea34daf94b6ee4d18e20286bf722", GitTreeState:"dirty"}
Server Version: version.Info{Major:"1", Minor:"3+", GitVersion:"v1.3.0-alpha.0.194+492c5d68fbedea-dirty", GitCommit:"492c5d68fbedea34daf94b6ee4d18e20286bf722", GitTreeState:"dirty"}
2016/03/07 14:41:50 e2e.go:196: Step 'get status' finished in 792.333976ms
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
2016/03/07 14:41:51 e2e.go:170: Cluster size (0) is too small to run e2e tests.  2 Nodes are required.
exit status 1
$

Help?

wojtek-t commented 8 years ago

@rsc - sorry it's late evening in Poland already - let me answer all questions above tomorrow morning Poland time

ixdy commented 8 years ago

@rsc it's only mentioned in passing in the kubemark doc, but did you start a "regular" Kubernetes cluster first as noted in https://github.com/golang/go/issues/14396#issuecomment-193373127?

ixdy commented 8 years ago

You shouldn't need to edit cluster/gce/config-test.sh though; setting NUM_NODES and NODE_SIZE in the environment should be sufficient.

rsc commented 8 years ago

@ixdy, thanks, I did not understand that comment the first time. Creating that other cluster does seem to have gotten the e2e tests going.

rsc commented 8 years ago

OK, both at master and at dff7490, I now get further, but I don't think the test is working.

I've gotten the command sequence down to:

git checkout dff7490c57f309
make clean
make quick-release
go run hack/e2e.go -v -down
NUM_NODES=3 NODE_SIZE=n1-standard-4 go run hack/e2e.go -v -up
NUM_NODES=100 MASTER_SIZE=n1-standard-4 ./test/kubemark/start-kubemark.sh
test/kubemark/run-e2e-tests.sh --delete-namespace=false

and then that last command prints:

$ test/kubemark/run-e2e-tests.sh --delete-namespace=false
Kubemark master name: kubernetes-kubemark-master
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
Using master: kubernetes-kubemark-master (external IP: 104.154.20.128)
Setting up for KUBERNETES_PROVIDER="kubemark".
Your active configuration is: [default]

Project: kubertest2
Zone: us-central1-b
I0307 15:58:56.266669   30705 e2e.go:237] Starting e2e run "681e73a2-e4a7-11e5-aeec-8cdcd443ecd8" on Ginkgo node 1
Running Suite: Kubernetes e2e suite
===================================
Random Seed: 1457384335 - Will randomize all specs
Will run 153 of 221 specs

Mar  7 15:58:56.283: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

Mar  7 15:58:56.287: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Mar  7 15:58:56.487: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Mar  7 15:58:56.487: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Kubectl client Kubectl describe 
  should check if kubectl describe prints relevant information for rc and pods [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:618
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:58:56.487: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:58:56.531: INFO: Skipping waiting for service account
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:86
[It] should check if kubectl describe prints relevant information for rc and pods [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:618
Mar  7 15:58:56.531: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc create -f /usr/local/google/home/rsc/kubernetes/examples/guestbook-go/redis-master-controller.json --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:58:56.885: INFO: stdout: "replicationcontroller \"redis-master\" created\n"
Mar  7 15:58:56.885: INFO: stderr: ""
Mar  7 15:58:56.885: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc create -f /usr/local/google/home/rsc/kubernetes/examples/guestbook-go/redis-master-service.json --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:58:57.243: INFO: stdout: "service \"redis-master\" created\n"
Mar  7 15:58:57.243: INFO: stderr: ""
Mar  7 15:58:57.287: INFO: Waiting up to 5m0s for pod redis-master-ugeq9 status to be running
Mar  7 15:58:57.327: INFO: Waiting for pod redis-master-ugeq9 in namespace 'e2e-tests-kubectl-jw96o' status to be 'running'(found phase: "Pending", readiness: false) (39.138172ms elapsed)
Mar  7 15:58:59.366: INFO: Found pod 'redis-master-ugeq9' on node '10.245.2.28'
Mar  7 15:58:59.366: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc describe pod redis-master-ugeq9 --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:58:59.755: INFO: stdout: "Name:\t\tredis-master-ugeq9\nNamespace:\te2e-tests-kubectl-jw96o\nImage(s):\tredis\nNode:\t\t10.245.2.28/10.245.2.28\nStart Time:\tMon, 07 Mar 2016 15:58:56 -0500\nLabels:\t\tapp=redis,role=master\nStatus:\t\tRunning\nReason:\t\t\nMessage:\t\nIP:\t\t2.3.4.5\nControllers:\tReplicationController/redis-master\nContainers:\n  redis-master:\n    Container ID:\tdocker:///k8s_redis-master.45360489_redis-master-ugeq9_e2e-tests-kubectl-jw96o_6881c31f-e4a7-11e5-84b2-42010a800002_516c42b0\n    Image:\t\tredis\n    Image ID:\t\tdocker://\n    QoS Tier:\n      cpu:\t\tBestEffort\n      memory:\t\tBestEffort\n    State:\t\tRunning\n      Started:\t\tMon, 07 Mar 2016 15:58:57 -0500\n    Ready:\t\tTrue\n    Restart Count:\t0\n    Environment Variables:\nConditions:\n  Type\t\tStatus\n  Ready \tTrue \nNo volumes.\nEvents:\n  FirstSeen\tLastSeen\tCount\tFrom\t\t\tSubobjectPath\tType\t\tReason\t\tMessage\n  ---------\t--------\t-----\t----\t\t\t-------------\t--------\t------\t\t-------\n  3s\t\t3s\t\t1\t{default-scheduler }\t\t\tNormal\t\tScheduled\tSuccessfully assigned redis-master-ugeq9 to 10.245.2.28\n\n\n"
Mar  7 15:58:59.755: INFO: stderr: ""
Mar  7 15:58:59.756: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc describe rc redis-master --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:59:00.129: INFO: stdout: "Name:\t\tredis-master\nNamespace:\te2e-tests-kubectl-jw96o\nImage(s):\tredis\nSelector:\tapp=redis,role=master\nLabels:\t\tapp=redis,role=master\nReplicas:\t1 current / 1 desired\nPods Status:\t1 Running / 0 Waiting / 0 Succeeded / 0 Failed\nNo volumes.\nEvents:\n  FirstSeen\tLastSeen\tCount\tFrom\t\t\t\tSubobjectPath\tType\t\tReason\t\t\tMessage\n  ---------\t--------\t-----\t----\t\t\t\t-------------\t--------\t------\t\t\t-------\n  4s\t\t4s\t\t1\t{replication-controller }\t\t\tNormal\t\tSuccessfulCreate\tCreated pod: redis-master-ugeq9\n\n\n"
Mar  7 15:59:00.129: INFO: stderr: ""
Mar  7 15:59:00.129: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc describe service redis-master --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:59:00.516: INFO: stdout: "Name:\t\t\tredis-master\nNamespace:\t\te2e-tests-kubectl-jw96o\nLabels:\t\t\tapp=redis,role=master\nSelector:\t\tapp=redis,role=master\nType:\t\t\tClusterIP\nIP:\t\t\t10.0.0.212\nPort:\t\t\t<unnamed>\t6379/TCP\nEndpoints:\t\t2.3.4.5:6379\nSession Affinity:\tNone\nNo events.\n\n"
Mar  7 15:59:00.516: INFO: stderr: ""
Mar  7 15:59:00.697: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc describe node 10.245.0.10'
Mar  7 15:59:01.088: INFO: stdout: "Name:\t\t\t10.245.0.10\nLabels:\t\t\tkubernetes.io/hostname=10.245.0.10\nCreationTimestamp:\tMon, 07 Mar 2016 15:57:55 -0500\nPhase:\t\t\t\nConditions:\n  Type\t\tStatus\tLastHeartbeatTime\t\t\tLastTransitionTime\t\t\tReason\t\t\t\tMessage\n  ----\t\t------\t-----------------\t\t\t------------------\t\t\t------\t\t\t\t-------\n  OutOfDisk \tFalse \tMon, 07 Mar 2016 15:58:55 -0500 \tMon, 07 Mar 2016 15:57:55 -0500 \tKubeletHasSufficientDisk \tkubelet has sufficient disk space available\n  Ready \tTrue \tMon, 07 Mar 2016 15:58:55 -0500 \tMon, 07 Mar 2016 15:57:55 -0500 \tKubeletReady \t\t\tkubelet is posting ready status\nAddresses:\t10.245.0.10,10.245.0.10\nCapacity:\n memory:\t0\n pods:\t\t40\n cpu:\t\t0\nSystem Info:\n Machine ID:\t\t\t\n System UUID:\t\t\t\n Boot ID:\t\t\t\n Kernel Version:\t\t\n OS Image:\t\t\t\n Container Runtime Version:\tdocker://1.1.3\n Kubelet Version:\t\tv1.2.0-alpha.6.909+dff7490c57f309\n Kube-Proxy Version:\t\tv1.2.0-alpha.6.909+dff7490c57f309\nExternalID:\t\t\t10.245.0.10\nNon-terminated Pods:\t\t(0 in total)\n  Namespace\t\t\tName\t\tCPU Requests\tCPU Limits\tMemory Requests\tMemory Limits\n  ---------\t\t\t----\t\t------------\t----------\t---------------\t-------------\nAllocated resources:\n  (Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)\n  CPU Requests\t\t\tCPU Limits\t\t\tMemory Requests\t\t\tMemory Limits\n  ------------\t\t\t----------\t\t\t---------------\t\t\t-------------\n  0 (-9223372036854775808%)\t0 (-9223372036854775808%)\t0 (-9223372036854775808%)\t0 (-9223372036854775808%)\nEvents:\n  FirstSeen\tLastSeen\tCount\tFrom\t\t\t\tSubobjectPath\tType\t\tReason\t\tMessage\n  ---------\t--------\t-----\t----\t\t\t\t-------------\t--------\t------\t\t-------\n  1m\t\t1m\t\t1\t{kube-proxy 10.245.0.10}\t\t\tNormal\t\tStarting\tStarting kube-proxy.\n  1m\t\t1m\t\t1\t{controllermanager }\t\t\t\tNormal\t\tRegisteredNode\tNode 10.245.0.10 event: Registered Node 10.245.0.10 in NodeController\n\n\n"
Mar  7 15:59:01.088: INFO: stderr: ""
Mar  7 15:59:01.088: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc describe namespace e2e-tests-kubectl-jw96o'
Mar  7 15:59:01.462: INFO: stdout: "Name:\te2e-tests-kubectl-jw96o\nLabels:\te2e-framework=kubectl,e2e-run=681e73a2-e4a7-11e5-aeec-8cdcd443ecd8\nStatus:\tActive\n\nNo resource quota.\n\nNo resource limits.\n\n\n"
Mar  7 15:59:01.462: INFO: stderr: ""
[AfterEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
Mar  7 15:59:01.462: INFO: Waiting up to 1m0s for all nodes to be ready
Mar  7 15:59:01.648: INFO: Found DeleteNamespace=false, skipping namespace deletion!

• [SLOW TEST:5.161 seconds]
Kubectl client
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:995
  Kubectl describe
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:619
    should check if kubectl describe prints relevant information for rc and pods [Conformance]
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:618
------------------------------
Addon update 
  should propagate add-on file changes
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:322
[BeforeEach] Addon update
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:59:01.648: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:59:01.695: INFO: Skipping waiting for service account
[BeforeEach] Addon update
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:214
[It] should propagate add-on file changes
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:322
[AfterEach] Addon update
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
Mar  7 15:59:01.696: INFO: Waiting up to 1m0s for all nodes to be ready
Mar  7 15:59:01.803: INFO: Found DeleteNamespace=false, skipping namespace deletion!
[AfterEach] Addon update
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:222

S [SKIPPING] [0.155 seconds]
Addon update
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:323
  should propagate add-on file changes [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/addon_update.go:322

  Mar  7 15:59:01.695: Only supported for providers [gce] (not kubemark)

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/util.go:263
------------------------------
SSS
------------------------------
Kubectl client Kubectl apply 
  should apply a new configuration to an existing RC
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:505
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:59:01.803: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:59:01.843: INFO: Skipping waiting for service account
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:86
[It] should apply a new configuration to an existing RC
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:505
STEP: creating Redis RC
Mar  7 15:59:01.843: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc create -f /usr/local/google/home/rsc/kubernetes/examples/guestbook-go/redis-master-controller.json --namespace=e2e-tests-kubectl-kjl0q'
Mar  7 15:59:02.190: INFO: stdout: "replicationcontroller \"redis-master\" created\n"
Mar  7 15:59:02.190: INFO: stderr: ""
STEP: applying a modified configuration
Mar  7 15:59:02.193: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc apply -f - --namespace=e2e-tests-kubectl-kjl0q'
Mar  7 15:59:02.582: INFO: stdout: "replicationcontroller \"redis-master\" configured\n"
Mar  7 15:59:02.582: INFO: stderr: ""
STEP: checking the result
[AfterEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
Mar  7 15:59:02.621: INFO: Waiting up to 1m0s for all nodes to be ready
Mar  7 15:59:02.762: INFO: Found DeleteNamespace=false, skipping namespace deletion!
•
------------------------------
Kubectl client Proxy server 
  should support --unix-socket=/path [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:993
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:59:02.763: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:59:02.803: INFO: Skipping waiting for service account
[BeforeEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:86
[It] should support --unix-socket=/path [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:993
STEP: Starting the proxy
Mar  7 15:59:02.803: INFO: Asynchronously running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc proxy --unix-socket=/tmp/kubectl-proxy-unix513838643/test'
STEP: retrieving proxy /api/ output
[AfterEach] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
Mar  7 15:59:03.027: INFO: Waiting up to 1m0s for all nodes to be ready
Mar  7 15:59:03.137: INFO: Found DeleteNamespace=false, skipping namespace deletion!
•
------------------------------
MetricsGrabber 
  should grab all metrics from a Kubelet.
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/metrics_grabber_test.go:112
[BeforeEach] MetricsGrabber
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:59:03.137: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:59:03.177: INFO: Skipping waiting for service account
[BeforeEach] MetricsGrabber
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/metrics_grabber_test.go:90
W0307 15:59:03.251062   30705 metrics_grabber.go:74] Master node is not registered. Grabbing metrics from Scheduler and ControllerManager is disabled.
[It] should grab all metrics from a Kubelet.
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/metrics_grabber_test.go:112
[AfterEach] MetricsGrabber
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
Mar  7 15:59:03.251: INFO: Waiting up to 1m0s for all nodes to be ready
Mar  7 15:59:03.321: INFO: Found DeleteNamespace=false, skipping namespace deletion!
•
------------------------------
Port forwarding With a server that expects no client request 
  should support a client that connects, sends no data, and disconnects [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/portforward.go:238
[BeforeEach] Port forwarding
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:69
STEP: Creating a kubernetes client
Mar  7 15:59:03.321: INFO: >>> testContext.KubeConfig: /usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc

STEP: Building a namespace api object
Mar  7 15:59:03.366: INFO: Skipping waiting for service account
[It] should support a client that connects, sends no data, and disconnects [Conformance]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/portforward.go:238
STEP: creating the target pod
Mar  7 15:59:03.431: INFO: Waiting up to 5m0s for pod pfpod status to be running
Mar  7 15:59:03.469: INFO: Waiting for pod pfpod in namespace 'e2e-tests-port-forwarding-ad7sb' status to be 'running'(found phase: "Pending", readiness: false) (38.627068ms elapsed)
Mar  7 15:59:05.508: INFO: Found pod 'pfpod' on node '10.245.2.10'
STEP: Running 'kubectl port-forward'
Mar  7 15:59:05.508: INFO: starting port-forward command and streaming output
Mar  7 15:59:05.508: INFO: Asynchronously running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc port-forward --namespace=e2e-tests-port-forwarding-ad7sb pfpod :80'
Mar  7 15:59:05.510: INFO: reading from `kubectl port-forward` command's stderr
[AfterEach] Port forwarding
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
STEP: Collecting events from namespace "e2e-tests-port-forwarding-ad7sb".
Mar  7 15:59:36.053: INFO: event for pfpod: {default-scheduler } Scheduled: Successfully assigned pfpod to 10.245.2.10
Mar  7 15:59:36.093: INFO: POD                 NODE         PHASE    GRACE  CONDITIONS
Mar  7 15:59:36.094: INFO: redis-master-ugeq9  10.245.2.28  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:58:58 -0500 EST  }]
Mar  7 15:59:36.094: INFO: redis-master-p2zsc  10.245.0.29  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:03 -0500 EST  }]
Mar  7 15:59:36.094: INFO: redis-master-tltco  10.245.0.14  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:03 -0500 EST  }]
Mar  7 15:59:36.094: INFO: pfpod               10.245.2.10  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:04 -0500 EST  }]
Mar  7 15:59:36.094: INFO: 
Mar  7 15:59:36.277: INFO: 
Logging node info for node 10.245.0.10
Mar  7 15:59:36.316: INFO: Node Info: &{{ } {10.245.0.10   /api/v1/nodes/10.245.0.10 4417323c-e4a7-11e5-84b2-42010a800002 1274 0 2016-03-07 15:57:55 -0500 EST <nil> <nil> map[kubernetes.io/hostname:10.245.0.10] map[]} { 10.245.0.10  false} {map[cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI} pods:{40.000 DecimalSI}] map[memory:{0.000 DecimalSI} pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI}]  [{OutOfDisk False 2016-03-07 15:59:35 -0500 EST 2016-03-07 15:57:55 -0500 EST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-03-07 15:59:35 -0500 EST 2016-03-07 15:57:55 -0500 EST KubeletReady kubelet is posting ready status}] [{LegacyHostIP 10.245.0.10} {InternalIP 10.245.0.10}] {{10250}} {     docker://1.1.3 v1.2.0-alpha.6.909+dff7490c57f309 v1.2.0-alpha.6.909+dff7490c57f309} []}}
Mar  7 15:59:36.316: INFO: 
Logging kubelet events for node 10.245.0.10
Mar  7 15:59:36.354: INFO: 
Logging pods the kubelet thinks is on node 10.245.0.10
Mar  7 16:00:06.393: INFO: Unable to retrieve kubelet pods for node 10.245.0.10
Mar  7 16:00:06.393: INFO: 
Logging node info for node 10.245.0.11
Mar  7 16:00:06.432: INFO: Node Info: &{{ } {10.245.0.11   /api/v1/nodes/10.245.0.11 40fa9f2f-e4a7-11e5-84b2-42010a800002 1525 0 2016-03-07 15:57:50 -0500 EST <nil> <nil> map[kubernetes.io/hostname:10.245.0.11] map[]} { 10.245.0.11  false} {map[pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI}] map[cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI} pods:{40.000 DecimalSI}]  [{OutOfDisk False 2016-03-07 16:00:00 -0500 EST 2016-03-07 15:57:50 -0500 EST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-03-07 16:00:00 -0500 EST 2016-03-07 15:57:50 -0500 EST KubeletReady kubelet is posting ready status}] [{LegacyHostIP 10.245.0.11} {InternalIP 10.245.0.11}] {{10250}} {     docker://1.1.3 v1.2.0-alpha.6.909+dff7490c57f309 v1.2.0-alpha.6.909+dff7490c57f309} []}}
Mar  7 16:00:06.432: INFO: 
Logging kubelet events for node 10.245.0.11
Mar  7 16:00:06.470: INFO: 
Logging pods the kubelet thinks is on node 10.245.0.11
Mar  7 16:00:36.509: INFO: Unable to retrieve kubelet pods for node 10.245.0.11
...

where each stanza sits for a little while and then says "Unable to retrieve kubelet pods for node XXX". I am guessing this is not good. I let it run for quite a while once and then interrupted it. I can let it run longer next time if this is actually OK. When I interrupt it I get:

Logging pods the kubelet thinks is on node 10.245.0.14
^C
---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately

Ran 152 of 221 Specs in 188.326 seconds
FAIL! -- 4 Passed | 1 Failed | 2 Pending | 67 Skipped 
Ginkgo ran 1 suite in 3m8.672947611s
Test Suite Failed

Am I making progress?

gmarek commented 8 years ago

Yes - you're really close. If you're passing any flags (--delete-namespace=false) to run-test commands it overwrites all flags (I should finally fix this). In such case you should also pass --ginkgo.focus="starting 30 pods" command, to run only Density test (or --ginkgo.focus="\[Feature:Performance\]" for both Density and Load tests)

rsc commented 8 years ago

OK, without the --delete-namespace=false, I get some latency numbers before it goes into the apparently infinite "Unable to retrieve kubelet pods ..." loop.

It looks like this is the one that was reported as high latency in the second comment on kubernetes/kubernetes#20400:

Mar  8 15:23:18.442: INFO: API calls latencies: {
  "apicalls": [
    {
      "resource": "nodes",
      "verb": "LIST",
      "latency": {
        "Perc50": 12624000,
        "Perc90": 18181000,
        "Perc99": 82431000
      }
    },

and then after more latencies I get the usual painful pod loop, taking 30 seconds per node:

[AfterEach] Density [Skipped]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
STEP: Collecting events from namespace "e2e-tests-density-7oooq".
Mar  8 15:23:19.034: INFO: POD                 NODE         PHASE    GRACE  CONDITIONS
Mar  8 15:23:19.034: INFO: redis-master-ugeq9  10.245.2.28  Running         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:58:58 -0500 EST  }]
Mar  8 15:23:19.034: INFO: redis-master-ik3ua               Pending         []
Mar  8 15:23:19.034: INFO: redis-master-p2zsc  10.245.0.29  Running  30s    [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:03 -0500 EST  }]
Mar  8 15:23:19.034: INFO: pfpod               10.245.2.10  Running         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:04 -0500 EST  }]
Mar  8 15:23:19.034: INFO: 
Mar  8 15:23:19.136: INFO: 
Logging node info for node 10.245.0.10
Mar  8 15:23:19.174: INFO: Node Info: &{{ } {10.245.0.10   /api/v1/nodes/10.245.0.10 4417323c-e4a7-11e5-84b2-42010a800002 838208 0 2016-03-07 15:57:55 -0500 EST <nil> <nil> map[kubernetes.io/hostname:10.245.0.10] map[]} { 10.245.0.10  false} {map[memory:{0.000 DecimalSI} pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI}] map[pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI}]  [{OutOfDisk Unknown 2016-03-08 15:15:38 -0500 EST 2016-03-08 15:16:32 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-03-08 15:15:38 -0500 EST 2016-03-08 15:16:32 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.}] [{LegacyHostIP 10.245.0.10} {InternalIP 10.245.0.10}] {{10250}} {     docker://1.1.3 v1.2.0-alpha.6.909+dff7490c57f309 v1.2.0-alpha.6.909+dff7490c57f309} []}}
Mar  8 15:23:19.175: INFO: 
Logging kubelet events for node 10.245.0.10
Mar  8 15:23:19.212: INFO: 
Logging pods the kubelet thinks is on node 10.245.0.10
Mar  8 15:23:49.251: INFO: Unable to retrieve kubelet pods for node 10.245.0.10
Mar  8 15:23:49.251: INFO: 
Logging node info for node 10.245.0.11
Mar  8 15:23:49.290: INFO: Node Info: &{{ } {10.245.0.11   /api/v1/nodes/10.245.0.11 40fa9f2f-e4a7-11e5-84b2-42010a800002 838132 0 2016-03-07 15:57:50 -0500 EST <nil> <nil> map[kubernetes.io/hostname:10.245.0.11] map[]} { 10.245.0.11  false} {map[pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI}] map[memory:{0.000 DecimalSI} pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI}]  [{OutOfDisk Unknown 2016-03-08 15:15:34 -0500 EST 2016-03-08 15:16:18 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-03-08 15:15:34 -0500 EST 2016-03-08 15:16:18 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.}] [{LegacyHostIP 10.245.0.11} {InternalIP 10.245.0.11}] {{10250}} {     docker://1.1.3 v1.2.0-alpha.6.909+dff7490c57f309 v1.2.0-alpha.6.909+dff7490c57f309} []}}
Mar  8 15:23:49.291: INFO: 
Logging kubelet events for node 10.245.0.11
Mar  8 15:23:49.328: INFO: 
Logging pods the kubelet thinks is on node 10.245.0.11
Mar  8 15:24:19.368: INFO: Unable to retrieve kubelet pods for node 10.245.0.11
Mar  8 15:24:19.368: INFO: 
Logging node info for node 10.245.0.12
Mar  8 15:24:19.406: INFO: Node Info: &{{ } {10.245.0.12   /api/v1/nodes/10.245.0.12 42996541-e4a7-11e5-84b2-42010a800002 838209 0 2016-03-07 15:57:53 -0500 EST <nil> <nil> map[kubernetes.io/hostname:10.245.0.12] map[]} { 10.245.0.12  false} {map[pods:{40.000 DecimalSI} cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI}] map[cpu:{0.000 DecimalSI} memory:{0.000 DecimalSI} pods:{40.000 DecimalSI}]  [{OutOfDisk Unknown 2016-03-08 15:15:38 -0500 EST 2016-03-08 15:16:32 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-03-08 15:15:38 -0500 EST 2016-03-08 15:16:32 -0500 EST NodeStatusUnknown Kubelet stopped posting node status.}] [{LegacyHostIP 10.245.0.12} {InternalIP 10.245.0.12}] {{10250}} {     docker://1.1.3 v1.2.0-alpha.6.909+dff7490c57f309 v1.2.0-alpha.6.909+dff7490c57f309} []}}

Is that part important? Can I just ^C it? Is there something useful it's doing? Is there a flag to turn it off?

wojtek-t commented 8 years ago

@rsc - thanks a lot for looking into it!

Regarding what to look at - if you are running Density test (and apparently you are from the logs above), there are lines like these in the test:

Feb  2 14:31:39.621: INFO: Top latency metric: {Resource:nodes Verb:LIST Latency:{Perc50:121.186ms Perc90:378.483ms Perc99:652.822ms}}
Feb  2 14:31:39.621: INFO: Top latency metric: {Resource:pods Verb:LIST Latency:{Perc50:80.953ms Perc90:155.79ms Perc99:548.367ms}}
Feb  2 14:31:39.621: INFO: Top latency metric: {Resource:componentstatuses Verb:LIST Latency:{Perc50:260.097ms Perc90:260.097ms Perc99:260.097ms}}
Feb  2 14:31:39.621: INFO: Top latency metric: {Resource:nodes Verb:PUT Latency:{Perc50:14.102ms Perc90:20.246ms Perc99:250.619ms}}
Feb  2 14:31:39.621: INFO: Top latency metric: {Resource:nodes Verb:GET Latency:{Perc50:793µs Perc90:3.118ms Perc99:173.743ms}}

[You can grep the logs for "Top latency metric".]

And one line like this:

Mar  8 22:25:20.509: INFO: perc50: 1.48257294s, perc90: 2.211355607s, perc99: 2.614765988s

Those are basically lines that we mostly focusing on.

Regarding your second question - this painful loop should appear only if the test actually fails (metrics I mentioned above exceeded some threshold that we set up). In that case, we are trying to gather more logs for future debugging. From your perspective you can ^C them, but if you do that you need to cleanup the the cluster after it so that you don't have some rubbish during the next run. The easiest way to do that is to simply to: ./stop-kubemark.sh ./start-kubemark.sh [You don't need to touch the external cluster at all]. This is probably also the fastest solution (if you work at head, there is a better way to do - so let me know if I should describe it).

BTW, looking into your logs I see:

STEP: Collecting events from namespace "e2e-tests-density-7oooq".
Mar  8 15:23:19.034: INFO: POD                 NODE         PHASE    GRACE  CONDITIONS
Mar  8 15:23:19.034: INFO: redis-master-ugeq9  10.245.2.28  Running         [{Ready False 0001-01- 01 00:00:00 +0000 UTC 2016-03-07 15:58:58 -0500 EST  }]
Mar  8 15:23:19.034: INFO: redis-master-ik3ua               Pending         []
Mar  8 15:23:19.034: INFO: redis-master-p2zsc  10.245.0.29  Running  30s    [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:03 -0500 EST  }]
Mar  8 15:23:19.034: INFO: pfpod               10.245.2.10  Running         [{Ready False 0001-01-01 00:00:00 +0000 UTC 2016-03-07 15:59:04 -0500 EST  }]

This is really strange - because we don't start any redis containers in this tests. Did you start them manually?

rsc commented 8 years ago

@wojtek-t I've never started a redis before. No idea where those came from.

wojtek-t commented 8 years ago

@rsc that's strange. Can you please do: ./stop-kubemark.sh ./start-kubemark.sh ./"run tests" (you can leave the external cluster without touching it)

and check if that is repeatable? To be honest, we are running kubemark continuously (we already had 5000+ runs) and I've never seen such thing...

rsc commented 8 years ago

OK, I added stop-kubemark.sh to my overall running script and I will see if it continues to show redis. My script to try a particular revision is now:

#!/bin/bash
REV=${1:-dff7490c57f309}
(
date
git checkout $REV
make clean
make quick-release
go run hack/e2e.go -v -down
NUM_NODES=3 NODE_SIZE=n1-standard-4 go run hack/e2e.go -v -up
NUM_NODES=100 MASTER_SIZE=n1-standard-4 ./test/kubemark/stop-kubemark.sh
NUM_NODES=100 MASTER_SIZE=n1-standard-4 ./test/kubemark/start-kubemark.sh
test/kubemark/run-e2e-tests.sh 
) 2>&1 | tee -a trylog.$REV

wojtek-t commented 8 years ago

@rsc - I think that just to save your time, you can create the external cluster just once and then remove lines:

go run hack/e2e.go -v -down
NUM_NODES=3 NODE_SIZE=n1-standard-4 go run hack/e2e.go -v -up

from your script. This should save you some time and will not affect tests, because the external cluster is just an environment for the test (and its performance doesn't matter for the performance of the cluster you are testing).

But what you are doing is also fine BTW.

wojtek-t commented 8 years ago

@rsc - BTW, can you please ensure that you don't have any remaining kubemark-related machines in your project? Thanks!

rsc commented 8 years ago

@wojtek-t pretend I know nothing about gcloud or kubernetes. What command would you like me to run? :-)

wojtek-t commented 8 years ago

@rsc - I would suggest using pantheon: Compute Engine > VM Instances

rsc commented 8 years ago

@wojtek-t, I'm running a kubemark test right now, but this is what I see in the cloud console. It seems OK to me. I'll report back whether the eventual output mentions redis. Note that I am using kubernetes/kubernetes@dff7490, which was a while ago. Maybe there was a usage of redis at that point?

screen shot 2016-03-09 at 10 56 21 am

wojtek-t commented 8 years ago

Yes - the screeshot you pasted looks as expected.

Regarding redis - I don't think we have ever use redis in some default setup. If this happens again to you, I will try to debug it tomorrow. Thanks a lot for looking into it!

rsc commented 8 years ago

@wojtek-t This looks relevant: redis being started for guestbook example app during e2e tests.

Mar  7 15:58:56.531: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc create -f /usr/local/google/home/rsc/kubernetes/examples/guestbook-go/redis-master-controller.json --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:58:56.885: INFO: stdout: "replicationcontroller \"redis-master\" created\n"
Mar  7 15:58:56.885: INFO: stderr: ""
Mar  7 15:58:56.885: INFO: Running '/usr/local/google/home/rsc/kubernetes/_output/dockerized/bin/linux/amd64/kubectl --server=https://104.154.20.128 --kubeconfig=/usr/local/google/home/rsc/kubernetes/test/kubemark/kubeconfig.loc create -f /usr/local/google/home/rsc/kubernetes/examples/guestbook-go/redis-master-service.json --namespace=e2e-tests-kubectl-jw96o'
Mar  7 15:58:57.243: INFO: stdout: "service \"redis-master\" created\n"
Mar  7 15:58:57.243: INFO: stderr: ""
Mar  7 15:58:57.287: INFO: Waiting up to 5m0s for pod redis-master-ugeq9 status to be running
Mar  7 15:58:57.327: INFO: Waiting for pod redis-master-ugeq9 in namespace 'e2e-tests-kubectl-jw96o' status to be 'running'(found phase: "Pending", readiness: false) (39.138172ms elapsed)
Mar  7 15:58:59.366: INFO: Found pod 'redis-master-ugeq9' on node '10.245.2.28'

wojtek-t commented 8 years ago

@rsc - I see, but those tests shouldn't even start in this case. I'm not sure at what branch are you running tests, but the only test that you should be running is: "should allow starting 30 pods per node" https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/run-e2e-tests.sh#L37

I don't know, why other tests are also being started - it seems you are not passing any arguments to: test/kubemark/run-e2e-tests.sh

rsc commented 8 years ago

@wojtek-t I see. That redis is from when I was passing --delete-namespace=false, which is suggested in kubemark-guide.md but wipes out the default test restriction you pointed at. I no longer use that flag - I just run it with no arguments - and I can now confirm that I haven't seen redis again. That mystery is solved at least.

wojtek-t commented 8 years ago

@rsc - yeah, that makes perfect sense. Thanks!

wojtek-t commented 8 years ago

@rsc - one more thing to note here - we are not using Go compiler from your workstation. We do all the builds in Docker, which means that if you want to compile kubernetes with a different Go version, you need to modify this: https://github.com/kubernetes/kubernetes/blob/master/build/build-image/cross/Dockerfile#L18

rsc commented 8 years ago

Thanks, yes, I found that and have been modifying it to flip between 1.4.2, 1.4.3, and 1.5.3. So far I seem to be able to reproduce the minor differences in 100-node tests and I found the kube-apiserver.log that has the raw data in it. This is all good. I am waiting on a quota bump so I can run 1000-node tests.

wojtek-t commented 8 years ago

@rsc - I've just sent you an invite to my project where I have enough quota to run 1000-node kubemarks. Feel free to use it.

wojtek-t commented 8 years ago

@rsc - were you able to reproduce the issue? Anything I can help with?

rsc commented 8 years ago

Hi @wojtek-t. Yes, I've been able to reproduce the observed kubemark latency increases on 1000-node systems. I'm still using your GCE project. Still gathering data.

Tentatively, it looks like the difference in average latency from Go 1.4.2 to Go 1.5.3 was mainly due to Go 1.5.3 doing a better job of keeping heap size at 2 x live data size, but that meant more frequent garbage collection and therefore increased overall garbage collection overhead. Running Go 1.5.3 with GOGC=300 (target 4x instead of 2x) brings things back to roughly Go 1.4 levels. On Kubemark that makes plenty of sense, because the benchmark is using about 2 GB of memory on an otherwise mostly idle 120 GB (n1-standard-32) machine; letting it use 4 GB instead is clearly reasonable for improved latency. It's not clear to me whether kube-apiserver should do that in general, since I don't know whether it usually runs on such idle machines.

I also tried with Go 1.6.0, and that looks like it is doing a better job at latency even without any GOGC tuning, so my suggestion would be to jump from Go 1.4.2 to Go 1.6.0 and not bother trying to game the GOGC setting.

This is what I've seen so far:

                       P50   / P90   / P99

Go 1.4.2    GOGC=100   132ms / 248ms / 468ms
Go 1.4.2    GOGC=100   129ms / 261ms / 473ms

Go 1.4.3    GOGC=100   148ms / 311ms / 468ms

Go 1.5.3    GOGC=100   123ms / 410ms / 713ms
Go 1.5.3    GOGC=300   141ms / 177ms / 547ms 
Go 1.5.3    GOGC=900   112ms / 136ms / 412ms

Go 1.6.0    GOGC=100   117ms / 252ms / 449ms

So Go 1.6.0 is by default about as good as Go 1.4.2 for the 'LIST nodes' metric.

Unfortunately, the test using Go 1.6.0 is failing by now complaining about pod startup latency. Normally you see measured latencies of 1-2 seconds, and the limit in the test is 5 seconds, but the claim in the failure is that some pod startup latencies are measured in minutes. That would obviously be bad, but I can't tell whether it's true. For the 1000 node test starting 30000 pods, both the passing and the failing tests take about 35 minutes to start all the pods. That total time doesn't change even though the reported startup latency changes dramatically. I haven't worked out how the time is measured or what might influence it. If you have any ideas about that, I'd love to hear them. Right now I think that if we can figure out the pod startup latency issue then the API call latency problems are basically fixed by Go 1.6.

Separately, from looking at CPU profiles to try to understand the Go 1.5.3 delta I've identified a number of simple optimizations in the kubernetes code that would likely improve API call latency across the board. I will measure and send those once all the Go-specific stuff is done.

wojtek-t commented 8 years ago

@rsc - awesome debugging thanks!

Regarding Go 1.6.0 - are those failure repeatable all the time? Because we recently (~2-3 weeks ago) fixed about that caused pod startup test failing with 1-2 minutes latencies from time to time. IIUC, you were using some old branch, so that fix may not be there. So if it is sometime failing, sometimes passing, this is probably our bug (already fixed at head).

Separately, from looking at CPU profiles to try to understand the Go 1.5.3 delta I've identified a number of simple optimizations in the kubernetes code that would likely improve API call latency across the board. I will measure and send those once all the Go-specific stuff is done.

Great - looking forward for them and happy to review them.

I have one more question. From what you wrote above, it seems you don't observe any difference when changing to Go 1.4.3, right? From what I know, we were observing some differences. But I was on paternity leave the whole January, when this was observed, so this is only what I've heard. If that's not true in your experiments, I will try to verify it again tomorrow or early next week.

rsc commented 8 years ago

@wojtek-t, I have two 1.4.2 runs and one 1.4.3 run in the numbers above. There does appear to be some difference in that one run at the 50th and 90th percentile, though not 99th. I don't have enough data to know if that's a real measurement or just noise.

gmarek commented 8 years ago

We'll try to recreate the problem on Monday.

rsc commented 8 years ago

I repeated the kubemark test with Go 1.6.0 on top of current master and it passes. Both pod latency and API call latency are considered OK by the test. The API latency is higher than what I was seeing before, but the test also does a different set of operations than it used to, which could easily affect the overall distribution in either direction. I am running 1.6.0 and 1.4.2 a bunch of times to gather actual statistics. But at least if kubemark is the bar then it looks like Go 1.6.0 is acceptable. I'll have more about 1.4.2 vs 1.6.0 on the current kubemark master after the weekend. Still using your GCE project.

FWIW, I noticed that recent changes to the build process mean that 'make quick-release' assumes it has write access to gcr.io/google_containers. I patched around it (rsc/kubernetes@f6dc8f6) but it seems like a potential mistake waiting to happen, especially since there's nothing forcing build/build-image/cross/Dockerfile and build/build-image/cross/VERSION to stay in sync. It's easy to imagine someone updating Dockerfile and forgetting to update VERSION and thereby accidentally replacing, say, kube-cross:v1.4.2-1 with an image using Go 1.6. Or maybe I misunderstand and they are immutable? Regardless, it also makes it harder for non-k8s engineers like me to run tests.

lavalamp commented 8 years ago

I agree that's not good (even though we have to sign into a separate account to push). Filed https://github.com/kubernetes/kubernetes/issues/22868, thanks.

(Note that some containers are merely tagged as being gcr.io/google_containers and then shipped off to the nodes manually (scp)-- they should not be actually pushed to our registry.)

luxas commented 8 years ago

/cc @luxas

Cross-ref to the go1.6 PR: kubernetes/kubernetes#22149

golang / go

all: kubernetes slower execution since Go 1.4.3 #14396