oomichi / try-kubernetes

12 stars 5 forks source link

2 test failures of [sig-instrumentation] MetricsGrabber #45

Closed oomichi closed 5 years ago

oomichi commented 6 years ago

まとめ

oomichi commented 6 years ago
~ Failure [6.451 seconds]
[sig-instrumentation] MetricsGrabber
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/common/framework.go:23
  should grab all metrics from a Scheduler. [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/monitoring/metrics_grabber.go:61

  Expected error:
      <*errors.StatusError | 0xc421a2c3f0>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
              Status: "Failure",
              Message: "the server is currently unable to handle the request (get pods kube-scheduler-k8s-master:10251)",
              Reason: "ServiceUnavailable",
              Details: {
                  Name: "kube-scheduler-k8s-master:10251",
                  Group: "",
                  Kind: "pods",
                  UID: "",
                  Causes: [
                      {
                          Type: "UnexpectedServerResponse",
                          Message: "Error: 'dial tcp 192.168.1.108:10251: connect: connection refused'\nTrying to reach: 'http://192.168.1.108:10251/metrics'",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 503,
          },
      }
      the server is currently unable to handle the request (get pods kube-scheduler-k8s-master:10251)
  not to have occurred

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/monitoring/metrics_grabber.go:78
oomichi commented 6 years ago

kube-scheduler が待ち合わせている事がわかる。 → localhost で LISTEN しているから、他のホストにある e2e テストからはつながらないよね・・

$ sudo netstat -anp | grep 10251
tcp        0      0 127.0.0.1:10251         0.0.0.0:*               LISTEN      2491/kube-scheduler
oomichi commented 6 years ago

以下がテストコード コメントでは「API Server 経由で Pod (ここでは kube-scheduler?) につなぐ」とある

 61         gin.It("should grab all metrics from a Scheduler.", func() {
 62                 gin.By("Proxying to Pod through the API server")
 63                 // Check if master Node is registered
 64                 nodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{})
 65                 framework.ExpectNoError(err)
 66
 67                 var masterRegistered = false
 68                 for _, node := range nodes.Items {
 69                         if strings.HasSuffix(node.Name, "master") {
 70                                 masterRegistered = true
 71                         }
 72                 }
 73                 if !masterRegistered {
 74                         framework.Logf("Master is node api.Registry. Skipping testing Scheduler metrics.")
 75                         return
 76                 }
 77                 response, err := grabber.GrabFromScheduler()
 78                 framework.ExpectNoError(err)
 79                 gom.Expect(response).NotTo(gom.BeEmpty())
 80         })

問題の関数

126 func (g *MetricsGrabber) GrabFromScheduler() (SchedulerMetrics, error) {
127         if !g.registeredMaster {
128                 return SchedulerMetrics{}, fmt.Errorf("Master's Kubelet is not registered. Skipping Scheduler's metrics gathering.")
129         }
130         output, err := g.getMetricsFromPod(g.client, fmt.Sprintf("%v-%v", "kube-scheduler", g.masterName), metav1.NamespaceSystem, ports.SchedulerPort)
131         if err != nil {
132                 return SchedulerMetrics{}, err
133         }
134         return parseSchedulerMetrics(output)
135 }

さらに

233 func (g *MetricsGrabber) getMetricsFromPod(client clientset.Interface, podName string, namespace string, port int) (string, error) {
234         rawOutput, err := client.CoreV1().RESTClient().Get().
235                 Namespace(namespace).
236                 Resource("pods").
237                 SubResource("proxy").
238                 Name(fmt.Sprintf("%v:%v", podName, port)).
239                 Suffix("metrics").
240                 Do().Raw()
241         if err != nil {
242                 return "", err
243         }
244         return string(rawOutput), nil
245 }
oomichi commented 6 years ago
$ kubectl describe pod kube-scheduler-k8s-master -n=kube-system
Name:               kube-scheduler-k8s-master
Namespace:          kube-system
Priority:           2000000000
...
    Liveness:     http-get http://127.0.0.1:10251/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
...

127.0.0.1 に curl すると成功。しかし 192.168.1.108 で curl すると失敗 上記のとおり、kube-scheduler が 127.0.0.1 で LISTEN しているから。

$ curl http://127.0.0.1:10251/healthz
ok
$
$ curl http://192.168.1.108:10251/healthz
curl: (7) Failed to connect to 192.168.1.108 port 10251: Connection refused
$
oomichi commented 6 years ago
                        Message: "Error: 'dial tcp 192.168.1.108:10251: connect: connection refused'\nTrying to reach: 'http://192.168.1.108:10251/metrics'",

の 192.168.1.108 は何処から来ている?

oomichi commented 6 years ago

問題となったAPI

I0822 02:20:16.133328   20404 round_trippers.go:383] GET https://192.168.1.108:6443/api/v1/namespaces/kube-system/pods/kube-scheduler-k8s-master:10251/proxy/metrics
I0822 02:20:16.133341   20404 round_trippers.go:390] Request Headers:
I0822 02:20:16.133345   20404 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf, */*
I0822 02:20:16.133349   20404 round_trippers.go:393]     User-Agent: e2e.test/v1.11.1 (linux/amd64) kubernetes/d0b061a
I0822 02:20:16.136376   20404 round_trippers.go:408] Response Status: 503 Service Unavailable in 3 milliseconds
I0822 02:20:16.136387   20404 round_trippers.go:411] Response Headers:
I0822 02:20:16.136391   20404 round_trippers.go:414]     Content-Type: text/plain; charset=utf-8
I0822 02:20:16.136394   20404 round_trippers.go:414]     Content-Length: 120
I0822 02:20:16.136398   20404 round_trippers.go:414]     Date: Wed, 22 Aug 2018 02:20:16 GMT
I0822 02:20:16.136412   20404 request.go:897] Response Body: Error: 'dial tcp 192.168.1.108:10251: connect: connection refused'
Trying to reach: 'http://192.168.1.108:10251/metrics'
oomichi commented 6 years ago

一先ず、manifest を改造

$ sudo diff -u etc/kube-scheduler.yaml.orig /etc/kubernetes/manifests/kube-scheduler.yaml
--- etc/kube-scheduler.yaml.orig        2018-08-22 02:30:16.060204589 +0000
+++ /etc/kubernetes/manifests/kube-scheduler.yaml       2018-08-22 02:30:38.160555932 +0000
@@ -13,7 +13,7 @@
   containers:
   - command:
     - kube-scheduler
-    - --address=127.0.0.1
+    - --address=192.168.1.108
     - --kubeconfig=/etc/kubernetes/scheduler.conf
     - --leader-elect=true
     image: k8s.gcr.io/kube-scheduler-amd64:v1.11.1

指定したアドレスで scheduler が LISTEN していることを確認

$ sudo netstat -anp | grep 10251
tcp        0      0 192.168.1.108:10251     0.0.0.0:*               LISTEN      6468/kube-scheduler

テストを実行 → 通るようになったことを確認

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=should\sgrab\sall\smetrics\sfrom\sa\sSche
duler" --check-version-skew=false
...
~ [SLOW TEST:6.274 seconds]
[sig-instrumentation] MetricsGrabber
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/common/framework.go:23
  should grab all metrics from a Scheduler.
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/instrumentation/monitoring/metrics_grabber.go:61
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSAug 22 02:33:23.242: INFO: Running AfterSuite actions on all node
Aug 22 02:33:23.242: INFO: Running AfterSuite actions on node 1

Ran 1 of 999 Specs in 6.386 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 998 Skipped PASS

Ginkgo ran 1 suite in 6.678185057s
Test Suite Passed
2018/08/22 02:33:23 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=should\sgrab\sall\smetrics\sfrom\sa\sScheduler' finished in 6.716759429s
2018/08/22 02:33:23 e2e.go:83: Done
oomichi commented 6 years ago

上記の変更でよいのか k/kubernetes/issues/67685 として問い合わせ中

oomichi commented 5 years ago

下記のテストが引き続き失敗中のため

[Fail] [sig-instrumentation] MetricsGrabber [It] should grab all metrics from a ControllerManager.