oomichi / try-kubernetes

12 stars 5 forks source link

Test Failure: [HPA] Horizontal pod autoscaling (scale resource: CPU) #18

Closed oomichi closed 6 years ago

oomichi commented 6 years ago

結論

This is one of issue https://github.com/oomichi/try-kubernetes/issues/17

The test is failed because of

I0404 22:01:33.211083   14594 runners.go:175] Pod rs-cz9bx      k8s-v110-flannel-worker Running <nil>
I0404 22:01:33.211152   14594 runners.go:175] Pod rs-fm7c8      k8s-v110-flannel-worker Running <nil>
I0404 22:01:33.211163   14594 runners.go:175] Pod rs-g96bg      k8s-v110-flannel-worker Running <nil>
I0404 22:01:33.211180   14594 runners.go:175] Pod rs-j2dq9              Pending <nil>
I0404 22:01:33.211189   14594 runners.go:175] Pod rs-wcpmh      k8s-v110-flannel-worker Running <nil>
Apr  4 22:01:33.211: INFO: Unexpected error occurred: Only 4 pods started out of 5

and that happens due to less compute node like:

Apr  4 22:01:33.225: INFO: At 2018-04-04 21:59:23 +0000 UTC - event for rs-j2dq9: {default-scheduler } FailedScheduling: 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate.

本件、kubernetes v1.11 でも引き続き発生。

oomichi commented 6 years ago

All 8 HPA tests failed:

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=\[HPA\]"
[..]
Summarizing 8 Failures:

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:448

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet [It] Should scale from 1 pod to 3 pods and from 3 to 5
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 5 pods to 3 pods and from 3 to 1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:454

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light [It] Should scale from 2 pods to 1 pod
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light [It] Should scale from 1 pod to 2 pods
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 1 pod to 3 pods and from 3 to 5
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet [It] Should scale from 5 pods to 3 pods and from 3 to 1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:461

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

Ran 8 of 877 Specs in 5200.752 seconds
FAIL! -- 0 Passed | 8 Failed | 0 Pending | 869 Skipped --- FAIL: TestE2E (5200.77s)
FAIL

Ginkgo ran 1 suite in 1h26m40.988437641s
Test Suite Failed 
oomichi commented 6 years ago

On a target cluster with 2 compute nodes (previously 1 compute node), the following insufficient error stops. But different error still happens and HPA tests cannot pass now.

Unschedulable 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate.
oomichi commented 6 years ago
Apr 24 23:11:25.268: INFO: waiting for 2 replicas (current: 1)
Apr 24 23:11:45.267: INFO: waiting for 2 replicas (current: 1)
Apr 24 23:11:45.270: INFO: waiting for 2 replicas (current: 1)
Apr 24 23:11:45.270: INFO: Unexpected error occurred: timed out waiting for the condition
[..]
Logging pods the kubelet thinks is on node k8s-v110-flannel-worker02
Apr 24 23:12:01.679: INFO: kube-flannel-ds-kqtts started at 2018-04-24 20:59:44 +0000 UTC (1+1 container statuses recorded)
Apr 24 23:12:01.679: INFO:      Init container install-cni ready: true, restart count 0
Apr 24 23:12:01.679: INFO:      Container kube-flannel ready: true, restart count 1
Apr 24 23:12:01.679: INFO: kube-proxy-gsqtb started at 2018-04-24 20:59:44 +0000 UTC (0+1 container statuses recorded)
Apr 24 23:12:01.679: INFO:      Container kube-proxy ready: true, restart count 0
Apr 24 23:12:01.746: INFO:
Latency metrics for node k8s-v110-flannel-worker02
STEP: Dumping a list of prepulled images on each node...
Apr 24 23:12:01.750: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "e2e-tests-horizontal-pod-autoscaling-hxfzj" for this suite.
Apr 24 23:12:07.781: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Apr 24 23:12:07.957: INFO: namespace: e2e-tests-horizontal-pod-autoscaling-hxfzj, resource: bindings, ignored listing per whitelist
Apr 24 23:12:07.991: INFO: namespace e2e-tests-horizontal-pod-autoscaling-hxfzj deletion completed in 6.238444611s

~ Failure [947.898 seconds]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
  [sig-autoscaling] ReplicationController light
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
    Should scale from 1 pod to 2 pods [It]
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:70

    timeout waiting 15m0s for 2 replicas
    Expected error:
        <*errors.errorString | 0xc4200c35a0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123
------------------------------
SSSSSSSSSSSSSSApr 24 23:12:07.991: INFO: Running AfterSuite actions on all node
Apr 24 23:12:07.992: INFO: Running AfterSuite actions on node 1

The corresponding test code is::

 70                 It("Should scale from 1 pod to 2 pods", func() {
 71                         scaleTest := &HPAScaleTest{
 72                                 initPods:                    1,
 73                                 totalInitialCPUUsage:        150,
 74                                 perPodCPURequest:            200,
 75                                 targetCPUUtilizationPercent: 50,
 76                                 minPods:                     1,
 77                                 maxPods:                     2,
 78                                 firstScale:                  2,
 79                         }
 80                         scaleTest.run("rc-light", common.KindRC, rc, f)
 81                 })
oomichi commented 6 years ago

Common part of HPA test

112 // run is a method which runs an HPA lifecycle, from a starting state, to an expected
113 // The initial state is defined by the initPods parameter.
114 // The first state change is due to the CPU being consumed initially, which HPA responds to by changing pod counts.
115 // The second state change (optional) is due to the CPU burst parameter, which HPA again responds to.
116 // TODO The use of 3 states is arbitrary, we could eventually make this test handle "n" states once this test stabilizes.
117 func (scaleTest *HPAScaleTest) run(name string, kind schema.GroupVersionKind, rc *common.ResourceConsumer, f *framework.Framework) {
118         const timeToWait = 15 * time.Minute
119         rc = common.NewDynamicResourceConsumer(name, f.Namespace.Name, kind, int(scaleTest.initPods), int(scaleTest.totalInitialCPUUsage), 0, 0, scaleTest.perPodCPURequest, 200, f.ClientSet, f.InternalClientset, f.ScalesGetter)
120         defer rc.CleanUp()
121         hpa := common.CreateCPUHorizontalPodAutoscaler(rc, scaleTest.targetCPUUtilizationPercent, scaleTest.minPods, scaleTest.maxPods)
122         defer common.DeleteHorizontalPodAutoscaler(rc, hpa.Name)
123         rc.WaitForReplicas(int(scaleTest.firstScale), timeToWait)

Waiting for initial condition(firstScale) before auto scaling happens. The timeout problem happens here due to mismatching expected count

124         if scaleTest.firstScaleStasis > 0 {
125                 rc.EnsureDesiredReplicas(int(scaleTest.firstScale), scaleTest.firstScaleStasis)
126         }
127         if scaleTest.cpuBurst > 0 && scaleTest.secondScale > 0 {
128                 rc.ConsumeCPU(scaleTest.cpuBurst)

Make CPU peak (cpuBurst).

129                 rc.WaitForReplicas(int(scaleTest.secondScale), timeToWait)

Waiting for auto-scaling happens.

130         }
131 }
oomichi commented 6 years ago

must read https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

$ kubectl autoscale rc foo --min=2 --max=5 --cpu-percent=80

This seems match to the following test code::

121         hpa := common.CreateCPUHorizontalPodAutoscaler(rc, scaleTest.targetCPUUtilizationPercent, scaleTest.minPods, scaleTest.maxPods)
oomichi commented 6 years ago

We can specify a single test with regex like

$ go run hack/e2e.go --provider=skeleton --test --test_args="--ginkgo.focus=1\spod\sto\s2\spods --ginkgo.dryRun=true"
oomichi commented 6 years ago

k8s v1.11.1 での実行結果: タイムアウト失敗

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=1\spod\sto\s2\spods"
2018/08/01 18:00:15 e2e.go:79: Calling kubetest --verbose-commands=true --provider=skeleton --test --test_args=--ginkgo.focus=1\spod\sto\s2\spods...
2018/08/01 18:00:15 process.go:153: Running: ./hack/e2e-internal/e2e-status.sh
Skeleton Provider: prepare-e2e not implemented
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-08-01T02:07:33Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
2018/08/01 18:00:15 process.go:155: Step './hack/e2e-internal/e2e-status.sh' finished in 230.118832ms
2018/08/01 18:00:15 process.go:153: Running: ./cluster/kubectl.sh --match-server-version=false version
2018/08/01 18:00:15 process.go:155: Step './cluster/kubectl.sh --match-server-version=false version' finished in 188.470631ms
2018/08/01 18:00:15 process.go:153: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=1\spod\sto\s2\spods
Conformance test: not doing test setup.
Aug  1 18:00:16.247: INFO: Overriding default scale value of zero to 1
Aug  1 18:00:16.247: INFO: Overriding default milliseconds value of zero to 5000
I0801 18:00:16.332682   31092 e2e.go:333] Starting e2e run "bec5c789-95b4-11e8-b4cc-fa163e738a69" on Ginkgo node 1
Running Suite: Kubernetes e2e suite
===================================
Random Seed: 1533146416 - Will randomize all specs
Will run 1 of 999 specs

Aug  1 18:00:16.356: INFO: >>> kubeConfig: /home/ubuntu/admin.conf
Aug  1 18:00:16.358: INFO: Waiting up to 30m0s for all (but 0) nodes to be schedulable
Aug  1 18:00:16.397: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Aug  1 18:00:16.451: INFO: 10 / 10 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Aug  1 18:00:16.451: INFO: expected 2 pod replicas in namespace 'kube-system', 2 are Running and Ready.
Aug  1 18:00:16.466: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]
Aug  1 18:00:16.466: INFO: Dumping network health container logs from all nodes...
Aug  1 18:00:16.479: INFO: e2e test version: v1.11.1
Aug  1 18:00:16.484: INFO: kube-apiserver version: v1.11.1
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light
  Should scale from 1 pod to 2 pods
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:70
[BeforeEach] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:141
STEP: Creating a kubernetes client
Aug  1 18:00:16.485: INFO: >>> kubeConfig: /home/ubuntu/admin.conf
STEP: Building a namespace api object
Aug  1 18:00:16.622: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
STEP: Waiting for a default service account to be provisioned in namespace
[It] Should scale from 1 pod to 2 pods
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:70
STEP: Running consuming RC rc-light via /v1, Kind=ReplicationController with 1 replicas
STEP: creating replication controller rc-light in namespace e2e-tests-horizontal-pod-autoscaling-dklx7
I0801 18:00:16.645232   31092 runners.go:177] Created replication controller with name: rc-light, namespace: e2e-tests-horizontal-pod-autoscaling-dklx7, replica count: 1
I0801 18:00:26.695869   31092 runners.go:177] rc-light Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
STEP: Running controller
STEP: creating replication controller rc-light-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-dklx7
I0801 18:00:26.767988   31092 runners.go:177] Created replication controller with name: rc-light-ctrl, namespace: e2e-tests-horizontal-pod-autoscaling-dklx7, replica count: 1
I0801 18:00:36.819542   31092 runners.go:177] rc-light-ctrl Pods: 1 out of 1 created, 1 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
Aug  1 18:00:41.820: INFO: Waiting for amount of service:rc-light-ctrl endpoints to be 1
Aug  1 18:00:41.833: INFO: RC rc-light: consume 150 millicores in total
Aug  1 18:00:41.833: INFO: RC rc-light: setting consumption to 150 millicores in total
Aug  1 18:00:41.833: INFO: RC rc-light: consume 0 MB in total
Aug  1 18:00:41.833: INFO: RC rc-light: consume custom metric 0 in total
Aug  1 18:00:41.833: INFO: RC rc-light: setting bump of metric QPS to 0 in total
Aug  1 18:00:41.833: INFO: RC rc-light: sending request to consume 150 millicores
Aug  1 18:00:41.834: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=150&requestSizeMillicores=20 }
Aug  1 18:00:41.834: INFO: RC rc-light: sending request to consume 0 of custom metric QPS
Aug  1 18:00:41.834: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  1 18:00:41.834: INFO: RC rc-light: setting consumption to 0 MB in total
Aug  1 18:00:41.835: INFO: RC rc-light: sending request to consume 0 MB
Aug  1 18:00:41.835: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  1 18:00:41.865: INFO: waiting for 2 replicas (current: 1)
Aug  1 18:01:01.875: INFO: waiting for 2 replicas (current: 1)
Aug  1 18:01:11.871: INFO: RC rc-light: sending request to consume 0 MB
Aug  1 18:01:11.871: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  1 18:01:11.871: INFO: RC rc-light: sending request to consume 0 of custom metric QPS
Aug  1 18:01:11.871: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  1 18:01:12.042: INFO: RC rc-light: sending request to consume 150 millicores
Aug  1 18:01:12.042: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=150&requestSizeMillicores=20 }
Aug  1 18:01:21.870: INFO: waiting for 2 replicas (current: 1)
Aug  1 18:01:41.876: INFO: waiting for 2 replicas (current: 1)
Aug  1 18:01:41.894: INFO: RC rc-light: sending request to consume 0 of custom metric QPS
Aug  1 18:01:41.894: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  1 18:01:41.895: INFO: RC rc-light: sending request to consume 0 MB
Aug  1 18:01:41.895: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  1 18:01:42.241: INFO: RC rc-light: sending request to consume 150 millicores
Aug  1 18:01:42.241: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=150&requestSizeMillicores=20 }
Aug  1 18:02:01.871: INFO: waiting for 2 replicas (current: 1)
Aug  1 18:02:11.923: INFO: RC rc-light: sending request to consume 0 of custom metric QPS
Aug  1 18:02:11.923: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  1 18:02:11.923: INFO: RC rc-light: sending request to consume 0 MB
Aug  1 18:02:11.923: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  1 18:02:12.442: INFO: RC rc-light: sending request to consume 150 millicores
Aug  1 18:02:12.442: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-dklx7/services/rc-light-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=150&requestSizeMillicores=20 }
Aug  1 18:02:21.878: INFO: waiting for 2 replicas (current: 1)
...
oomichi commented 6 years ago

まずは HPA を有効にする方法を調べる。 → kubectl get hpa で確認可能。テスト環境は有効になっている模様。

公式サイト: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 「HPA は Control Loop として実装され、その同期タイミングは Controller Manager の --horizontal-pod-autoscaller-sync-period フラグによって制御される」 ここで言う Controller manager は kube-controller-manager Heapster も定期的に Metrics をチェックするようだが、それとは別物

「HPA は API リソースであり、autoscaling/v1 というAPI として定義されている。
 HPA の操作は kubectl で可能だ。kubectl create コマンドで HPA を作成し、
 kubectl get hpa で HPA のリストを取得できる。詳細は kubectl describe hpa で参照可能だ。
 また、kubectl delete hpa で削除も出来る」

問題が起きている環境で試したところ、No resources found が帰ってきたが、 API としては 200 で成功しているので HPA API 自体は存在する模様

$ kubectl get hpa -v=8
I0801 20:35:26.875828   28459 loader.go:359] Config loaded from file /home/ubuntu/admin.conf
I0801 20:35:26.877550   28459 loader.go:359] Config loaded from file /home/ubuntu/admin.conf
I0801 20:35:26.879167   28459 loader.go:359] Config loaded from file /home/ubuntu/admin.conf
I0801 20:35:26.885239   28459 loader.go:359] Config loaded from file /home/ubuntu/admin.conf
I0801 20:35:26.886342   28459 round_trippers.go:383] GET https://192.168.1.108:6443/apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers?limit=500
I0801 20:35:26.886372   28459 round_trippers.go:390] Request Headers:
I0801 20:35:26.886379   28459 round_trippers.go:393]     Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I0801 20:35:26.886385   28459 round_trippers.go:393]     User-Agent: kubectl/v1.11.1 (linux/amd64) kubernetes/b1b2997
I0801 20:35:26.899223   28459 round_trippers.go:408] Response Status: 200 OK in 12 milliseconds
I0801 20:35:26.899245   28459 round_trippers.go:411] Response Headers:
I0801 20:35:26.899256   28459 round_trippers.go:414]     Content-Type: application/json
I0801 20:35:26.899261   28459 round_trippers.go:414]     Content-Length: 2497
I0801 20:35:26.899265   28459 round_trippers.go:414]     Date: Wed, 01 Aug 2018 20:35:26 GMT
I0801 20:35:26.899464   28459 request.go:897] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1beta1","metadata":{"selfLink":"/apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers","resourceVersion":"1533485"},"columnDefinitions":[{"name":"Name","type":"string","format":"name","description":"Name must be unique within a namespace. Is required when creating resources, although some resources may allow a client to request the generation of an appropriate name automatically. Name is primarily intended for creation idempotence and configuration definition. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/identifiers#names","priority":0},{"name":"Reference","type":"string","format":"","description":"scaleTargetRef points to the target resource to scale, and is used to the pods for which metrics should be collected, as well as to actually change the replica count.","priority":0},{"name":"Targets","type":"string","format":"","description":"metrics contains the specifications for which to use to calculate the desired replica count (the ma [truncated 1473 chars]
No resources found.

作業を通して検出したドキュメントの不備を修正: https://github.com/kubernetes/website/pull/9711

oomichi commented 6 years ago

実際に HPA を試してみる。

  1. nginx の deploymentを作成。Pod は1つ
    $ kubectl run nginx --image=nginx
    $ kubectl get pods
    NAME                     READY     STATUS    RESTARTS   AGE
    nginx-64f497f8fd-l2wnb   1/1       Running   0          21h
    $
  2. 作成した Deployment の ReplicaSet を特定する → kubectl autoscale では deployment の直接指定も可能なので、この手順は不要かも
    $ kubectl get rs
    NAME               DESIRED   CURRENT   READY     AGE
    nginx-64f497f8fd   1         1         1         21h
  3. 最小 Pod は 1、最大3で CPU 利用率が 80% を超えるとスケールするように設定
    $ kubectl autoscale rs nginx-64f497f8fd --min=1 --max=3 --cpu-percent=80
    horizontalpodautoscaler.autoscaling/nginx-64f497f8fd autoscaled
  4. HPA の設定を確認 → unknown が気になる
    $ kubectl get hpa
    NAME               REFERENCE                     TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
    nginx-64f497f8fd   ReplicaSet/nginx-64f497f8fd   <unknown>/80%   1         3         1          4m
  5. 負荷対象を作るため nginx 用の service を作成
    $ kubectl expose deployment nginx --port 80
    service/nginx exposed
    $ kubectl get services
    NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
    nginx        ClusterIP   10.101.55.26   <none>        80/TCP    9s
  6. service にアクセス可能なことを確認する → endpoint は個々の pod の URL なので、ここでは LB の役割を果たす service を負荷対象にする
    $ curl http://10.101.55.26:80
    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>
    ...
  7. httperf で負荷をかけてみる
    $ sudo apt-get install httperf
    $ httperf --hog --server 10.101.55.26 --port 80 --rate 300 --num-conn 30000 --num-call 5 --timeout 5

    別コンソール → nginx の負荷がぜんぜん増えない・・

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    15607 ubuntu    20   0   16980   3040   2552 R  98.0  0.1   0:18.15 httperf
    10781 systemd+  20   0   33076   2968   1712 S  16.0  0.1   0:02.96 nginx
  8. HPA の CPU 設定を50% に変更、rate を10倍にする
    $ kubectl delete hpa nginx-64f497f8fd
    $ kubectl autoscale deployment nginx --min=1 --max=3 --cpu-percent=50
    $ httperf --hog --server 10.101.55.26 --port 80 --rate 3000 --num-conn 120000 --num-call 5 --timeout 5

    別コンソール

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    18466 ubuntu    20   0   18432   4324   2524 R  76.1  0.1   0:32.64 httperf
    10781 systemd+  20   0   33076   3292   1904 R  63.1  0.1   1:12.29 nginx

    120秒以上 CPU 利用率は60%を超えていたがスケールせず。

oomichi commented 6 years ago

Pod での CPU 利用率が取れていない?

kubectl の help に関連フラグの説明があった

      --requests='': The resource requirement requests for this container.  For example, 'cpu=100m,memory=256Mi'.  Note that server side components may assign requests depending on the server configuration, such as limit ranges.
oomichi commented 6 years ago

フラグ --requests をつけて、試してみる。

まずは cleanup

$ kubectl delete hpa nginx
$ kubectl delete service nginx
$ kubectl delete deployment nginx

各設定を実施

$ kubectl run nginx --image=nginx --requests='cpu=100m,memory=256Mi'
$ kubectl expose deployment nginx --port 80
$ kubectl autoscale deployment nginx --min=1 --max=3 --cpu-percent=50

相変わらず、unknown のまま

$ kubectl get hpa
NAME      REFERENCE          TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
nginx     Deployment/nginx   <unknown>/50%   1         3         0          7s

deployment 的には resource request が設定できている

$ kubectl describe deployment nginx
...
Pod Template:
  Labels:  run=nginx
  Containers:
   nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Requests:
      cpu:        100m
      memory:     256Mi
oomichi commented 6 years ago

/var/log/containers/kube-controller-manager-k8s-master_kube-system_kube-controller-manager-ecad4d8d03053b6ce0626f9ca60fec563f8b11a34e2b7446c5b40343fedf6a70.log に関連ログが出ていた。

{"log":"I0801 23:49:34.858775       1 event.go:221] Event(v1.ObjectReference{
     Kind:\"HorizontalPodAutoscaler\",
     Namespace:\"default\",
     Name:\"nginx\",
     UID:\"cba19f8b-95e3-11e8-ae67-fa163e420595\",
     APIVersion:\"autoscaling/v2beta1\",
     ResourceVersion:\"1547695\", FieldPath:\"\"}):
  type: 'Warning'
  reason: 'FailedComputeMetricsReplicas' failed to get cpu utilization: unable to get metrics for
   resource cpu: unable to fetch metrics from resource metrics API: the server could not find
   the requested resource (get pods.metrics.k8s.io)\n",
   "stream":"stderr",
   "time":"2018-08-01T23:52:35.04907021Z"}
pods.metrics.k8s.io)\n","stream":"stderr","time":"2018-08-01T23:49:34.859114171Z"}

システム再起動すると直るという話もあったので再起動してみた → だめだ、引き続き上記のログが出力され、unknown が表示される。

oomichi commented 6 years ago

Pod の CPU利用率 Metrics を取得する機能を調査する。

kubectl top なるものがあるが、heapster が無いためか実行できない

$ kubectl top pod -v=8
..
I0802 01:02:00.135619   23579 round_trippers.go:383] GET https://192.168.1.108:6443/api/v1/namespaces/kube-system/services/http:heapster:/proxy/apis/metrics/v1alpha1/nodes?labelSelector=
..
Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

上記API Call で 404 が返っている。 ログのエラーメッセージからエラー箇所を特定 controller/podautoscaler/replica_calculator.go

 54 // GetResourceReplicas calculates the desired replica count based on a target resource utilization percentage
 55 // of the given resource for pods matching the given selector in the given namespace, and the current replica count
 56 func (c *ReplicaCalculator) GetResourceReplicas(currentReplicas int32, targetUtilization int32, resource v1.ResourceName, namespace string, selector labe    ls.Selector) (replicaCount int32, utilization int32, rawUtilization int64, timestamp time.Time, err error) {
 57         metrics, timestamp, err := c.metricsClient.GetResourceMetric(resource, namespace, selector)
 58         if err != nil {
 59                 return 0, 0, 0, time.Time{}, fmt.Errorf("unable to get metrics for resource %s: %v", resource, err)
 60         }
oomichi commented 6 years ago

k/k issue #57996 で

The error (get pods.metrics.k8s.io) is due to metric server, as you are not running metric server. In Kubernetes 1.9+, HPA needs metric server to fetch the CPU usage.

とのコメントあり。 https://kubernetes.io/docs/tasks/debug-application-cluster/core-metrics-pipeline/#metrics-server からすると別途 Metrics Server の起動が必要そう・・ もしそうなら metrics-server が kubernetes-incubator にあるの変じゃないか・・ HPA はAPIとして Stable なのに、その構成要素が incubator なんて・・

一先ず、言われたとおりにmetrics-server をデプロイしてみる

$ git clone https://github.com/kubernetes-incubator/metrics-server
$ cd metrics-server/
$ kubectl create -f deploy/1.8+/
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
serviceaccount/metrics-server created
deployment.extensions/metrics-server created
service/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created

状況は変わらず、再起動してみる。 → だめだった。 nginx のデプロイから再度やってみる。 → 状況同じ

oomichi commented 6 years ago

metrics-server がエラーを出していることが判明、原因を調査する。

$ kubectl log metrics-server-86bd9d7667-ltlpz -n kube-system
...
I0802 01:38:30.801336       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-5lqpt
I0802 01:38:30.801429       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-5lqpt
E0802 01:39:05.015074       1 summary.go:97] error while getting metrics summary from Kubelet k8s-master(192.168.1.108:10255): Get http://192.168.1.108:10255/stats/summary/: dial tcp 192.168.1.108:10255: getsockopt: connection refused
E0802 01:39:05.015334       1 summary.go:97] error while getting metrics summary from Kubelet k8s-node01(192.168.1.109:10255): Get http://192.168.1.109:10255/stats/summary/: dial tcp 192.168.1.109:10255: getsockopt: connection refused
E0802 01:40:05.009938       1 summary.go:97] error while getting metrics summary from Kubelet k8s-node01(192.168.1.109:10255): Get http://192.168.1.109:10255/stats/summary/: dial tcp 192.168.1.109:10255: getsockopt: connection refused
E0802 01:40:05.017074       1 summary.go:97] error while getting metrics summary from Kubelet k8s-master(192.168.1.108:10255): Get http://192.168.1.108:10255/stats/summary/: dial tcp 192.168.1.108:10255: getsockopt: connection refused
I0802 01:40:17.567445       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-9m2tr
I0802 01:40:17.567539       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-9m2tr
I0802 01:40:47.621436       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-9m2tr
I0802 01:40:47.621522       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-9m2tr
E0802 01:41:05.004503       1 summary.go:97] error while getting metrics summary from Kubelet k8s-node01(192.168.1.109:10255): Get http://192.168.1.109:10255/stats/summary/: dial tcp 192.168.1.109:10255: getsockopt: connection refused
E0802 01:41:05.021225       1 summary.go:97] error while getting metrics summary from Kubelet k8s-master(192.168.1.108:10255): Get http://192.168.1.108:10255/stats/summary/: dial tcp 192.168.1.108:10255: getsockopt: connection refused
I0802 01:41:17.655429       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-9m2tr
I0802 01:41:17.655552       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-9m2tr

kubelet の設定を変える必要がありそう (kubespray/issues/2550)

This is due to the fact that the kublet read only port was disabled by default.
You can enable the read-only port in all.yml with: kube_read_only_port: 10255

設定箇所 cmd/kubelet/app/options/options.go

452         fs.Int32Var(&c.ReadOnlyPort, "read-only-port", c.ReadOnlyPort, "The read-only port for the Kubelet to serve on with no authentication/authorization (set to 0 to disable)")
oomichi commented 6 years ago

kubelet に read-only-port オプションを指定する方法を調査

既存 kubelet プロセスの起動オプション

root      1002     1  8 01:34 ?        01:19:57 /usr/bin/kubelet
 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
 --kubeconfig=/etc/kubernetes/kubelet.conf
 --config=/var/lib/kubelet/config.yaml
 --cgroup-driver=cgroupfs
 --cni-bin-dir=/opt/cni/bin
 --cni-conf-dir=/etc/cni/net.d
 --network-plugin=cni

config.yaml を以下のとおり変更し、システム再起動

--- /var/lib/kubelet/config.yaml.orig    2018-08-02 16:57:23.865340698 +0000
+++ /var/lib/kubelet/config.yaml 2018-08-02 17:14:29.502421376 +0000
@@ -57,6 +57,7 @@
 oomScoreAdj: -999
 podPidsLimit: -1
 port: 10250
+readOnlyPort: 10255
 registryBurst: 10
 registryPullQPS: 5
 resolvConf: /etc/resolv.conf

再起動後、Listen ポートを確認 → 10255 が Listen されていることを確認

$ sudo netstat -anp | grep 102 | grep kubelet
tcp        0      0 127.0.0.1:35881         0.0.0.0:*               LISTEN      1020/kubelet
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1020/kubelet
tcp6       0      0 :::10250                :::*                    LISTEN      1020/kubelet
tcp6       0      0 :::10255                :::*                    LISTEN      1020/kubelet

metrics-serverのログには接続失敗は表示されなくなったが、依然 Metrics が存在しないという課題あり

I0802 17:19:17.085093       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-9m2tr
I0802 17:19:17.085209       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-9m2tr
I0802 17:19:47.202876       1 reststorage.go:140] No metrics for container nginx in pod default/nginx-5b669f8784-9m2tr
I0802 17:19:47.202939       1 reststorage.go:93] No metrics for pod default/nginx-5b669f8784-9m2tr

node01 の設定をしていないため?

$ kubectl get pod -o=wide
NAME                     READY     STATUS    RESTARTS   AGE       IP           NODE
nginx-5b669f8784-9m2tr   1/1       Running   0          15h       10.244.1.6   k8s-node01

node01 にも設定したところ、metrics-server はエラーを出さず unknown も解消した

$ kubectl get hpa
NAME      REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
nginx     Deployment/nginx   0%/50%    1         3         1          15h

問題となっていた e2e テストを流してみる → 通るようになった

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=1\spod\sto\s2\spods"
..
~ [SLOW TEST:133.923 seconds]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
  [sig-autoscaling] ReplicationController light
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
    Should scale from 1 pod to 2 pods
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:70
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSAug  2 17:33:13.282: INFO: Running AfterSuite actions on all node
Aug  2 17:33:13.282: INFO: Running AfterSuite actions on node 1

Ran 1 of 999 Specs in 134.000 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 998 Skipped PASS

Ginkgo ran 1 suite in 2m14.312258357s
Test Suite Passed
2018/08/02 17:33:13 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=1\spod\sto\s2\spods' finished in 2m14.533305342s
2018/08/02 17:33:13 e2e.go:83: Done

kubectl top も動作するようになった

$ kubectl top pod
NAME                     CPU(cores)   MEMORY(bytes)
nginx-5b669f8784-9m2tr   0m           3Mi
oomichi commented 6 years ago

本件に関わる kubernetes そのものの改善点

oomichi commented 6 years ago

だめだ、未だ8件中1件しかテスト成功しない

~ Failure [960.002 seconds]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
  [sig-autoscaling] ReplicationController light
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
    Should scale from 2 pods to 1 pod [It]
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:82

    timeout waiting 15m0s for 1 replicas
    Expected error:
        <*errors.errorString | 0xc420085540>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSAug  2 20:34:37.432: INFO: Running AfterSuite actions on all node
Aug  2 20:34:37.432: INFO: Running AfterSuite actions on node 1

Summarizing 7 Failures:

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 5 pods to 3 pods and from 3 to 1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet [It] Should scale from 5 pods to 3 pods and from 3 to 1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 1 pod to 3 pods and from 3 to 5
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet [It] Should scale from 1 pod to 3 pods and from 3 to 5
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light [It] Should scale from 2 pods to 1 pod
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

Ran 8 of 999 Specs in 6821.750 seconds
FAIL! -- 1 Passed | 7 Failed | 0 Pending | 991 Skipped --- FAIL: TestE2E (6821.77s)
FAIL

Ginkgo ran 1 suite in 1h53m42.043036632s
Test Suite Failed
!!! Error in ./hack/ginkgo-e2e.sh:143
  Error in ./hack/ginkgo-e2e.sh:143. '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --ginkgo.flakeAttempts="${FLAKE_ATTEMPTS}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-region="${REGION:-}" --gce-multizone="${MULTIZONE:-false}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --cloud-config-file="${CLOUD_CONFIG:-}" --repo-root="${KUBE_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" --network="${KUBE_GCE_NETWORK:-${KUBE_GKE_NETWORK:-e2e}}" --node-tag="${NODE_TAG:-}" --master-tag="${MASTER_TAG:-}" --cluster-monitoring-mode="${KUBE_ENABLE_CLUSTER_MONITORING:-standalone}" --prometheus-monitoring="${KUBE_ENABLE_PROMETHEUS_MONITORING:-false}" ${KUBE_CONTAINER_RUNTIME:+"--container-runtime=${KUBE_CONTAINER_RUNTIME}"} ${MASTER_OS_DISTRIBUTION:+"--master-os-distro=${MASTER_OS_DISTRIBUTION}"} ${NODE_OS_DISTRIBUTION:+"--node-os-distro=${NODE_OS_DISTRIBUTION}"} ${NUM_NODES:+"--num-nodes=${NUM_NODES}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} "${@:-}"' exited with status 1
Call stack:
  1: ./hack/ginkgo-e2e.sh:143 main(...)
Exiting with status 1
2018/08/02 20:34:37 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[HPA\]' finished in 1h53m42.253120653s
2018/08/02 20:34:37 main.go:309: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.focus=\[HPA\]: exit status 1]
2018/08/02 20:34:37 e2e.go:81: err: exit status 1
exit status 1
oomichi commented 6 years ago

失敗したテストを見ると、Scale-down を含むもの? 最もシンプルな「Should scale from 2 pods to 1 pod」を単体で実行してみる

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=2\spods\sto\s1\spod"
...
Aug  2 20:54:43.663: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 ht-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMet
Aug  2 20:54:43.665: INFO: RC rc-light: sending request to consume 0 MB
Aug  2 20:54:43.665: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  2 20:54:50.470: INFO: RC rc-light: sending request to consume 50 millicores
Aug  2 20:54:50.470: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeCPU  false durationSec=30&millicores=50&requestSizeMillicores=20 }
Aug  2 20:54:53.038: INFO: waiting for 1 replicas (current: 2)
Aug  2 20:55:13.041: INFO: waiting for 1 replicas (current: 2)
Aug  2 20:55:13.692: INFO: RC rc-light: sending request to consume 0 of custom m
Aug  2 20:55:13.692: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 ht-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMet
Aug  2 20:55:13.696: INFO: RC rc-light: sending request to consume 0 MB
Aug  2 20:55:13.696: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  2 20:55:20.546: INFO: RC rc-light: sending request to consume 50 millicores
Aug  2 20:55:20.546: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeCPU  false durationSec=30&millicores=50&requestSizeMillicores=20 }
Aug  2 20:55:33.043: INFO: waiting for 1 replicas (current: 2)
Aug  2 20:55:43.700: INFO: RC rc-light: sending request to consume 0 of custom m
Aug  2 20:55:43.700: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 ht-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMet
Aug  2 20:55:43.703: INFO: RC rc-light: sending request to consume 0 MB
Aug  2 20:55:43.703: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
^[[AAug  2 20:55:50.630: INFO: RC rc-light: sending request to consume 50 millic
Aug  2 20:55:50.630: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/nroxy/ConsumeCPU  false durationSec=30&millicores=50&requestSizeMillicores=20 }
Aug  2 20:55:53.044: INFO: waiting for 1 replicas (current: 2)

Aug  2 20:56:13.042: INFO: waiting for 1 replicas (current: 2)
Aug  2 20:56:13.053: INFO: waiting for 1 replicas (current: 2)
Aug  2 20:56:13.053: INFO: Unexpected error occurred: timed out waiting for the condition
STEP: Removing consuming RC rc-light
Aug  2 20:56:13.076: INFO: RC rc-light: stopping CPU consumer
Aug  2 20:56:13.076: INFO: RC rc-light: stopping metric consumer
Aug  2 20:56:13.076: INFO: RC rc-light: stopping mem consumer
STEP: deleting { ReplicationController} rc-light in namespace e2e-tests-horizontal-pod-autoscaling-gk25p, will wait for the garbage collector to delete the pods
Aug  2 20:56:23.165: INFO: Deleting { ReplicationController} rc-light took: 27.919684ms
Aug  2 20:56:23.266: INFO: Terminating { ReplicationController} rc-light pods took: 100.554288ms
STEP: deleting { ReplicationController} rc-light-ctrl in namespace e2e-tests-horizontal-pod-autoscaling-gk25p, will wait for the garbage collector to delete the pods
Aug  2 20:56:31.089: INFO: Deleting { ReplicationController} rc-light-ctrl took: 7.49459ms
Aug  2 20:56:31.190: INFO: Terminating { ReplicationController} rc-light-ctrl pods took: 100.487127ms
[AfterEach] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:142
STEP: Collecting events from namespace "e2e-tests-horizontal-pod-autoscaling-gk25p".
STEP: Found 20 events.
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:47 +0000 UTC - event for rc-light: {replication-controller } SuccessfulCreate: Created pod: rc-light-s8q7k
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:47 +0000 UTC - event for rc-light: {replication-controller } SuccessfulCreate: Created pod: rc-light-8r7zs
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:47 +0000 UTC - event for rc-light-8r7zs: {default-scheduler } Scheduled: Successfully assigned e2e-tests-horizontal-pod-autoscaling-gk25p/rc-light-8r7zs to k8s-node01
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:47 +0000 UTC - event for rc-light-s8q7k: {default-scheduler } Scheduled: Successfully assigned e2e-tests-horizontal-pod-autoscaling-gk25p/rc-light-s8q7k to k8s-master
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:49 +0000 UTC - event for rc-light-8r7zs: {kubelet k8s-node01} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3" already present on machine
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:49 +0000 UTC - event for rc-light-8r7zs: {kubelet k8s-node01} Created: Created container
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:50 +0000 UTC - event for rc-light-8r7zs: {kubelet k8s-node01} Started: Started container
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:50 +0000 UTC - event for rc-light-s8q7k: {kubelet k8s-master} Started: Started container
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:50 +0000 UTC - event for rc-light-s8q7k: {kubelet k8s-master} Created: Created container
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:50 +0000 UTC - event for rc-light-s8q7k: {kubelet k8s-master} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3" already present on machine
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:57 +0000 UTC - event for rc-light-ctrl: {replication-controller } SuccessfulCreate: Created pod: rc-light-ctrl-d76tm
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:40:57 +0000 UTC - event for rc-light-ctrl-d76tm: {default-scheduler } Scheduled: Successfully assigned e2e-tests-horizontal-pod-autoscaling-gk25p/rc-light-ctrl-d76tm to k8s-node01
Aug  2 20:56:33.178: INFO: At 2018-08-02 20:41:00 +0000 UTC - event for rc-light-ctrl-d76tm: {kubelet k8s-node01} Started: Started container
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:41:00 +0000 UTC - event for rc-light-ctrl-d76tm: {kubelet k8s-node01} Created: Created container
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:41:00 +0000 UTC - event for rc-light-ctrl-d76tm: {kubelet k8s-node01} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/resource-consumer/controller-amd64:1.0" already present on machine
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:41:43 +0000 UTC - event for rc-light: {horizontal-pod-autoscaler } FailedGetResourceMetric: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:41:43 +0000 UTC - event for rc-light: {horizontal-pod-autoscaler } FailedComputeMetricsReplicas: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:56:23 +0000 UTC - event for rc-light-8r7zs: {kubelet k8s-node01} Killing: Killing container with id docker://rc-light:Need to kill Pod
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:56:23 +0000 UTC - event for rc-light-s8q7k: {kubelet k8s-master} Killing: Killing container with id docker://rc-light:Need to kill Pod
Aug  2 20:56:33.179: INFO: At 2018-08-02 20:56:31 +0000 UTC - event for rc-light-ctrl-d76tm: {kubelet k8s-node01} Killing: Killing container with id docker://rc-light-ctrl:Need to kill Pod
Aug  2 20:56:33.191: INFO: POD                                 NODE        PHASE    GRACE  CONDITIONS
Aug  2 20:56:33.191: INFO: nginx-5b669f8784-9m2tr              k8s-node01  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 01:39:40 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:23:48 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 01:39:40 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: coredns-78fcdf6894-gt7zj            k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-07-17 17:12:52 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:22 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-17 17:12:52 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: coredns-78fcdf6894-tk6ch            k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-07-17 17:12:52 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:22 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-17 17:12:52 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: etcd-k8s-master                     k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:05 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-apiserver-k8s-master           k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:05 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-controller-manager-k8s-master  k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:05 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-flannel-ds-7df6r               k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:12:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-17 17:12:31 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-flannel-ds-kl4q4               k8s-node01  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:01:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:23:43 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:01:01 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-proxy-hxp7z                    k8s-node01  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:08:51 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:23:42 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:08:51 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-proxy-zwrl4                    k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:08:37 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:17 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-07-31 23:08:37 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: kube-scheduler-k8s-master           k8s-master  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:05 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:17:04 +0000 UTC  }]
Aug  2 20:56:33.191: INFO: metrics-server-86bd9d7667-ltlpz     k8s-node01  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 01:28:14 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 17:23:49 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2018-08-02 01:28:14 +0000 UTC  }]
Aug  2 20:56:33.191: INFO:
Aug  2 20:56:33.195: INFO:
Logging node info for node k8s-master
Aug  2 20:56:33.199: INFO: Node Info: &Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-master,GenerateName:,Namespace:,SelfLink:/api/v1/nodes/k8s-master,UID:94f19db7-89e3-11e8-b234-fa163e420595,ResourceVersion:1648390,Generation:0,CreationTimestamp:2018-07-17 17:05:18 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{beta.kubernetes.io/arch: amd64,beta.kubernetes.io/os: linux,kubernetes.io/hostname: k8s-master,node-role.kubernetes.io/master: ,},Annotations:map[string]string{flannel.alpha.coreos.com/backend-data: {"VtepMAC":"1a:9d:81:1e:9d:0f"},flannel.alpha.coreos.com/backend-type: vxlan,flannel.alpha.coreos.com/kube-subnet-manager: true,flannel.alpha.coreos.com/public-ip: 192.168.1.108,kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock,node.alpha.kubernetes.io/ttl: 0,volumes.kubernetes.io/controller-managed-attach-detach: true,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:NodeSpec{PodCIDR:10.244.0.0/24,DoNotUse_ExternalID:,ProviderID:,Unschedulable:false,Taints:[],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{cpu: {{2 0} {<nil>} 2 DecimalSI},ephemeral-storage: {{41567956992 0} {<nil>} 40593708Ki BinarySI},hugepages-1Gi: {{0 0} {<nil>} 0 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{4143394816 0} {<nil>} 4046284Ki BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Allocatable:ResourceList{cpu: {{2 0} {<nil>} 2 DecimalSI},ephemeral-storage: {{37411161231 0} {<nil>} 37411161231 DecimalSI},hugepages-1Gi: {{0 0} {<nil>} 0 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{4038537216 0} {<nil>} 3943884Ki BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Phase:,Conditions:[{OutOfDisk False 2018-08-02 20:56:31 +0000 UTC 2018-07-17 17:05:14 +0000 UTC KubeletHasSufficientDisk kubelet has sufficient disk space available} {MemoryPressure False 2018-08-02 20:56:31 +0000 UTC 2018-07-17 17:05:14 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2018-08-02 20:56:31 +0000 UTC 2018-07-17 17:05:14 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2018-08-02 20:56:31 +0000 UTC 2018-07-17 17:05:14 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2018-08-02 20:56:31 +0000 UTC 2018-07-31 23:04:27 +0000 UTC KubeletReady kubelet is posting ready status. AppArmor enabled}],Addresses:[{InternalIP 192.168.1.108} {Hostname k8s-master}],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:10250,},},NodeInfo:NodeSystemInfo{MachineID:1db2c06c39a54cd3a93a4e0a44823fd6,SystemUUID:1DB2C06C-39A5-4CD3-A93A-4E0A44823FD6,BootID:b14bfb61-a0cc-45f4-8b29-42ee08c00ac6,KernelVersion:4.4.0-130-generic,OSImage:Ubuntu 16.04.5 LTS,ContainerRuntimeVersion:docker://1.11.2,KubeletVersion:v1.11.1,KubeProxyVersion:v1.11.1,OperatingSystem:linux,Architecture:amd64,},Images:[{[golang:1.10] 793901893} {[k8s.gcr.io/etcd-amd64:3.2.18] 218904307} {[k8s.gcr.io/kube-apiserver-amd64:v1.11.1] 186675825} {[k8s.gcr.io/kube-apiserver-amd64:v1.11.0] 186617744} {[k8s.gcr.io/kube-controller-manager-amd64:v1.11.1] 155252555} {[k8s.gcr.io/kube-controller-manager-amd64:v1.11.0] 155203118} {[nginx:latest] 108975101} {[k8s.gcr.io/kube-proxy-amd64:v1.11.1] 97776424} {[k8s.gcr.io/kube-proxy-amd64:v1.11.0] 97772373} {[k8s.gcr.io/kube-scheduler-amd64:v1.11.1] 56781436} {[k8s.gcr.io/kube-scheduler-amd64:v1.11.0] 56757023} {[quay.io/coreos/flannel:v0.9.1-amd64] 51338831} {[gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3] 49707607} {[k8s.gcr.io/coredns:1.1.3] 45587362} {[gcr.io/kubernetes-e2e-test-images/resource-consumer/controller-amd64:1.0] 5902947} {[k8s.gcr.io/pause-amd64:3.1 k8s.gcr.io/pause:3.1] 742472}],VolumesInUse:[],VolumesAttached:[],Config:nil,},}
Aug  2 20:56:33.199: INFO:
Logging kubelet events for node k8s-master
Aug  2 20:56:33.202: INFO:
Logging pods the kubelet thinks is on node k8s-master
Aug  2 20:56:33.211: INFO: kube-apiserver-k8s-master started at <nil> (0+0 container statuses recorded)
Aug  2 20:56:33.211: INFO: coredns-78fcdf6894-gt7zj started at 2018-07-17 17:12:52 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.212: INFO:      Container coredns ready: true, restart count 5
Aug  2 20:56:33.212: INFO: coredns-78fcdf6894-tk6ch started at 2018-07-17 17:12:52 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.212: INFO:      Container coredns ready: true, restart count 5
Aug  2 20:56:33.213: INFO: kube-flannel-ds-7df6r started at 2018-07-17 17:12:31 +0000 UTC (1+1 container statuses recorded)
Aug  2 20:56:33.213: INFO:      Init container install-cni ready: true, restart count 5
Aug  2 20:56:33.213: INFO:      Container kube-flannel ready: true, restart count 5
Aug  2 20:56:33.213: INFO: kube-scheduler-k8s-master started at <nil> (0+0 container statuses recorded)
Aug  2 20:56:33.214: INFO: etcd-k8s-master started at <nil> (0+0 container statuses recorded)
Aug  2 20:56:33.214: INFO: kube-proxy-zwrl4 started at 2018-07-31 23:08:37 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.214: INFO:      Container kube-proxy ready: true, restart count 5
Aug  2 20:56:33.214: INFO: kube-controller-manager-k8s-master started at <nil> (0+0 container statuses recorded)
Aug  2 20:56:33.243: INFO:
Latency metrics for node k8s-master
Aug  2 20:56:33.244: INFO:
Logging node info for node k8s-node01
Aug  2 20:56:33.247: INFO: Node Info: &Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-node01,GenerateName:,Namespace:,SelfLink:/api/v1/nodes/k8s-node01,UID:980d8d67-9515-11e8-a804-fa163e420595,ResourceVersion:1648376,Generation:0,CreationTimestamp:2018-07-31 23:01:01 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{beta.kubernetes.io/arch: amd64,beta.kubernetes.io/os: linux,kubernetes.io/hostname: k8s-node01,},Annotations:map[string]string{flannel.alpha.coreos.com/backend-data: {"VtepMAC":"a6:73:46:9f:a3:63"},flannel.alpha.coreos.com/backend-type: vxlan,flannel.alpha.coreos.com/kube-subnet-manager: true,flannel.alpha.coreos.com/public-ip: 192.168.1.109,kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock,node.alpha.kubernetes.io/ttl: 0,volumes.kubernetes.io/controller-managed-attach-detach: true,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:NodeSpec{PodCIDR:10.244.1.0/24,DoNotUse_ExternalID:,ProviderID:,Unschedulable:false,Taints:[],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{cpu: {{2 0} {<nil>} 2 DecimalSI},ephemeral-storage: {{41567956992 0} {<nil>} 40593708Ki BinarySI},hugepages-1Gi: {{0 0} {<nil>} 0 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{4143394816 0} {<nil>} 4046284Ki BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Allocatable:ResourceList{cpu: {{2 0} {<nil>} 2 DecimalSI},ephemeral-storage: {{37411161231 0} {<nil>} 37411161231 DecimalSI},hugepages-1Gi: {{0 0} {<nil>} 0 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{4038537216 0} {<nil>} 3943884Ki BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Phase:,Conditions:[{OutOfDisk False 2018-08-02 20:56:28 +0000 UTC 2018-07-31 23:01:01 +0000 UTC KubeletHasSufficientDisk kubelet has sufficient disk space available} {MemoryPressure False 2018-08-02 20:56:28 +0000 UTC 2018-07-31 23:01:01 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2018-08-02 20:56:28 +0000 UTC 2018-07-31 23:01:01 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2018-08-02 20:56:28 +0000 UTC 2018-07-31 23:01:01 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2018-08-02 20:56:28 +0000 UTC 2018-07-31 23:01:11 +0000 UTC KubeletReady kubelet is posting ready status. AppArmor enabled}],Addresses:[{InternalIP 192.168.1.109} {Hostname k8s-node01}],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:10250,},},NodeInfo:NodeSystemInfo{MachineID:817a385b9de241668e47cd87cda24f47,SystemUUID:817A385B-9DE2-4166-8E47-CD87CDA24F47,BootID:f47e3b62-dd98-42c4-91e2-9b657f85dfd3,KernelVersion:4.4.0-130-generic,OSImage:Ubuntu 16.04.4 LTS,ContainerRuntimeVersion:docker://1.11.2,KubeletVersion:v1.11.1,KubeProxyVersion:v1.11.1,OperatingSystem:linux,Architecture:amd64,},Images:[{[nginx:latest] 108975101} {[k8s.gcr.io/kube-proxy-amd64:v1.11.1] 97776424} {[k8s.gcr.io/kube-proxy-amd64:v1.11.0] 97772373} {[quay.io/coreos/flannel:v0.9.1-amd64] 51338831} {[gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3] 49707607} {[gcr.io/google_containers/metrics-server-amd64:v0.2.1] 42541759} {[gcr.io/kubernetes-e2e-test-images/resource-consumer/controller-amd64:1.0] 5902947} {[k8s.gcr.io/pause:3.1] 742472}],VolumesInUse:[],VolumesAttached:[],Config:nil,},}
Aug  2 20:56:33.247: INFO:
Logging kubelet events for node k8s-node01
Aug  2 20:56:33.250: INFO:
Logging pods the kubelet thinks is on node k8s-node01
Aug  2 20:56:33.266: INFO: nginx-5b669f8784-9m2tr started at 2018-08-02 01:39:40 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.266: INFO:      Container nginx ready: true, restart count 1
Aug  2 20:56:33.267: INFO: kube-flannel-ds-kl4q4 started at 2018-07-31 23:01:01 +0000 UTC (1+1 container statuses recorded)
Aug  2 20:56:33.267: INFO:      Init container install-cni ready: true, restart count 1
Aug  2 20:56:33.267: INFO:      Container kube-flannel ready: true, restart count 2
Aug  2 20:56:33.267: INFO: kube-proxy-hxp7z started at 2018-07-31 23:08:51 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.268: INFO:      Container kube-proxy ready: true, restart count 1
Aug  2 20:56:33.268: INFO: metrics-server-86bd9d7667-ltlpz started at 2018-08-02 01:28:14 +0000 UTC (0+1 container statuses recorded)
Aug  2 20:56:33.268: INFO:      Container metrics-server ready: true, restart count 1
Aug  2 20:56:33.345: INFO:
Latency metrics for node k8s-node01
STEP: Dumping a list of prepulled images on each node...
Aug  2 20:56:33.363: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "e2e-tests-horizontal-pod-autoscaling-gk25p" for this suite.

Aug  2 20:56:39.438: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Aug  2 20:56:39.533: INFO: namespace: e2e-tests-horizontal-pod-autoscaling-gk25p, resource: bindings, ignored listing per whitelist
Aug  2 20:56:39.577: INFO: namespace e2e-tests-horizontal-pod-autoscaling-gk25p deletion completed in 6.201374321s

~ Failure [951.844 seconds]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
  [sig-autoscaling] ReplicationController light
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
    Should scale from 2 pods to 1 pod [It]
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:82

    timeout waiting 15m0s for 1 replicas
    Expected error:
        <*errors.errorString | 0xc4201050d0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSAug  2 20:56:39.588: INFO: Running AfterSuite actions on all node
Aug  2 20:56:39.588: INFO: Running AfterSuite actions on node 1

Summarizing 1 Failure:

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] ReplicationController light [It] Should scale from 2 pods to 1 pod
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

Ran 1 of 999 Specs in 951.939 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 998 Skipped --- FAIL: TestE2E (951.96s)
FAIL

Ginkgo ran 1 suite in 15m52.197860762s
Test Suite Failed
!!! Error in ./hack/ginkgo-e2e.sh:143
  Error in ./hack/ginkgo-e2e.sh:143. '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --ginkgo.flakeAttempts="${FLAKE_ATTEMPTS}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-region="${REGION:-}" --gce-multizone="${MULTIZONE:-false}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --cloud-config-file="${CLOUD_CONFIG:-}" --repo-root="${KUBE_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" --network="${KUBE_GCE_NETWORK:-${KUBE_GKE_NETWORK:-e2e}}" --node-tag="${NODE_TAG:-}" --master-tag="${MASTER_TAG:-}" --cluster-monitoring-mode="${KUBE_ENABLE_CLUSTER_MONITORING:-standalone}" --prometheus-monitoring="${KUBE_ENABLE_PROMETHEUS_MONITORING:-false}" ${KUBE_CONTAINER_RUNTIME:+"--container-runtime=${KUBE_CONTAINER_RUNTIME}"} ${MASTER_OS_DISTRIBUTION:+"--master-os-distro=${MASTER_OS_DISTRIBUTION}"} ${NODE_OS_DISTRIBUTION:+"--node-os-distro=${NODE_OS_DISTRIBUTION}"} ${NUM_NODES:+"--num-nodes=${NUM_NODES}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} "${@:-}"' exited with status 1
Call stack:
  1: ./hack/ginkgo-e2e.sh:143 main(...)
Exiting with status 1
2018/08/02 20:56:39 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=2\spods\sto\s1\spod' finished in 15m52.41101064s
2018/08/02 20:56:39 main.go:309: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.focus=2\spods\sto\s1\spod: exit status 1]
2018/08/02 20:56:39 e2e.go:81: err: exit status 1
exit status 1

テスト実行中に HPA の状態を見ると Unknown になっている

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME       REFERENCE                        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-6j9j8   rc-light   ReplicationController/rc-light   <unknown>/50%   1         2         0          18s

テスト途中でとれるようになった。

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME       REFERENCE                        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-6j9j8   rc-light   ReplicationController/rc-light   52%/50%   1         2         2          2m

スケールダウンするか微妙な CPU 使用率になっている

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME       REFERENCE                        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-6j9j8   rc-light   ReplicationController/rc-light   49%/50%   1         2         2          12m

HPA 詳細

$ kubectl describe hpa rc-light -n e2e-tests-horizontal-pod-autoscaling-6j9j8
Name:                                                  rc-light
Namespace:                                             e2e-tests-horizontal-pod-autoscaling-6j9j8
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Thu, 02 Aug 2018 20:59:22 +0000
Reference:                                             ReplicationController/rc-light
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  52% (105m) / 50%
Min replicas:                                          1
Max replicas:                                          2
ReplicationController pods:                            2 current / 2 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                        Age   From                       Message
  ----     ------                        ----  ----                       -------
  Warning  FailedGetResourceMetric       3m    horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  3m    horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
oomichi commented 6 years ago

e2e テストにおける CPU 負荷のかけ方を調べてみる 要約

oomichi commented 6 years ago

CPU 使用率閾値と同じ CPU 負荷の場合、HPA の振る舞いとしてスケールダウンするべきか確認する

HPA のアルゴリズム: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#autoscaling-algorithm

The target number of pods is calculated from the following formula:

  TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)

上記の式だと、「CPU 使用率閾値と同じ CPU 負荷の場合で負荷ツールの動作にブレがある場合(負荷ツールにブレがあるのは普通)」スケールダウンしないじゃん! k/k pull/66950 として投稿済み

全実行結果 https://github.com/oomichi/try-kubernetes/issues/36 だと、このスケールダウンテストは成功。 負荷ツールの微妙なコンディションに依存するので、成功することもありえる。

oomichi commented 6 years ago

最新の HPA テストの失敗項目は以下のとおり

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 1 pod to 3 pods and from 3 to 5
[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 5 pods to 3 pods and from 3 to 1
[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet [It] Should scale from 5 pods to 3 pods and from 3 to 1
[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability
[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController [It] Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability
oomichi commented 6 years ago

一番シンプルな2段階スケールアウトテストを試す

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 1 pod to 3 pods and from 3 to 5
Aug  3 21:30:58.853: INFO: waiting for 3 replicas (current: 1)
Aug  3 21:31:18.859: INFO: waiting for 3 replicas (current: 1)
Aug  3 21:31:18.876: INFO: RC test-deployment: sending request to consume 0 MB
Aug  3 21:31:18.876: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  3 21:31:19.095: INFO: RC test-deployment: sending request to consume 250 millicores
Aug  3 21:31:19.095: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  3 21:31:19.095: INFO: RC test-deployment: sending request to consume 0 of custom metric QPS
Aug  3 21:31:19.096: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  3 21:31:38.855: INFO: waiting for 3 replicas (current: 4)
...
Aug  3 21:35:49.043: INFO: RC test-deployment: sending request to consume 0 MB
Aug  3 21:35:49.043: INFO: ConsumeMem URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/ConsumeMem  false durationSec=30&megabytes=0&requestSizeMegabytes=100 }
Aug  3 21:35:49.380: INFO: RC test-deployment: sending request to consume 0 of custom metric QPS
Aug  3 21:35:49.380: INFO: ConsumeCustomMetric URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/BumpMetric  false delta=0&durationSec=30&metric=QPS&requestSizeMetrics=10 }
Aug  3 21:35:50.577: INFO: RC test-deployment: sending request to consume 250 millicores
Aug  3 21:35:50.577: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-t5rkx/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  3 21:35:58.861: INFO: waiting for 3 replicas (current: 5)
Aug  3 21:36:18.848: INFO: waiting for 3 replicas (current: 5)

別コンソール

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-t5rkx   test-deployment   Deployment/test-deployment   95%/20%   1         5         4          2m

Targetが20%に対して、使用率が95% であれば Pod が3つに収まらないのは正常。 CPU負荷をかけるテストコードを確認する。

oomichi commented 6 years ago

e2e/autoscaling/horizontal_pod_autoscaling.go

 35         titleUp := "Should scale from 1 pod to 3 pods and from 3 to 5"
 36         titleDown := "Should scale from 5 pods to 3 pods and from 3 to 1"
 37
 38         SIGDescribe("[Serial] [Slow] Deployment", func() {
 39                 // CPU tests via deployments
 40                 It(titleUp, func() {
 41                         scaleUp("test-deployment", common.KindDeployment, false, rc, f)
 42                 })
...
133 func scaleUp(name string, kind schema.GroupVersionKind, checkStability bool, rc *common.ResourceConsumer, f *framework.Framework) {
134         stasis := 0 * time.Minute
135         if checkStability {
136                 stasis = 10 * time.Minute
137         }
138         scaleTest := &HPAScaleTest{
139                 initPods:                    1,
140                 totalInitialCPUUsage:        250,
141                 perPodCPURequest:            500,
142                 targetCPUUtilizationPercent: 20,
143                 minPods:                     1,
144                 maxPods:                     5,
145                 firstScale:                  3,
146                 firstScaleStasis:            stasis,
147                 cpuBurst:                    700,
148                 secondScale:                 5,
149         }
150         scaleTest.run(name, kind, rc, f)
151 }

CPU使用率の見え方を確認

ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:48:54 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   <unknown>/20%   1         5         1          1m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:49:20 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   <unknown>/20%   1         5         1          1m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:49:29 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   98%/20%   1         5         1          1m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:49:34 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   98%/20%   1         5         1          1m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:49:48 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   98%/20%   1         5         1          1m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:49:57 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   98%/20%   1         5         4          2m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:50:13 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   98%/20%   1         5         4          2m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:50:32 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   94%/20%   1         5         4          2m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:50:45 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   94%/20%   1         5         4          2m
ubuntu@k8s-master:~$ date; kubectl get hpa --all-namespaces
Fri Aug  3 22:50:55 UTC 2018
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-qbghv   test-deployment   Deployment/test-deployment   94%/20%   1         5         4          3m

テストとして totalInitialCPUUsage: 250 を設定して、kubectl として 98% (Podが1つのとき) に見える関係性を調べる。

oomichi commented 6 years ago

totalInitialCPUUsage: 250 は ConsumeCPU() の millicores として渡される。 これはその名のとおり、millicores なので 250 は 0.25 cores(25%) となる。 Target は 20% なので、Pods は2個? しかし、テストでの期待値は3個。 混乱・・・ 負荷コンテナの中身を調べる。 → resource-consumer というコンテナイメージがそれ test/utils/image/manifest.go

 25         e2eRegistry     = "gcr.io/kubernetes-e2e-test-images"
 ..
 91         ResourceConsumer    = ImageConfig{e2eRegistry, "resource-consumer", "1.3", true}

ソースは https://github.com/kubernetes/kubernetes/tree/master/test/images/resource-consumer このコンテナイメージを使って、CPU 負荷の出方を確認するしかなさそう・・ 以下のコマンドで直接、対象Dockerイメージを実行可能

$ sudo docker run --name resource-c gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3
Unable to find image 'gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3' locally
1.3: Pulling from kubernetes-e2e-test-images/resource-consumer
d2519f41f710: Pull complete
7bb286ada61f: Pull complete
0878ec3e77c3: Pull complete
137c312171bf: Pull complete
Digest: sha256:38f1007b38bd5ceab098ac24278066068e1cdd9e3eb2a540233b7d1053be2696
Status: Downloaded newer image for gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3

https://github.com/kubernetes/kubernetes/tree/master/test/images/resource-consumer によると kubectl での実行方法も書いてある。古いけど e2e テストのやり方に合わせて実行してみる。 ★--request パラメータを指定しないと Usage は unknown になり、HPA は動作しない

$ kubectl run resource-consumer --image=gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3 --port 8080 --requests='cpu=500m,memory=256Mi'
$ kubectl get pods -o wide
NAME                                 READY     STATUS    RESTARTS   AGE       IP             NODE
resource-consumer-767f57866f-vh4v7   1/1       Running   0          10s       10.244.1.192   k8s-node01
$ kubectl autoscale deployment resource-consumer --min=1 --max=5 --cpu-percent=20
$ kubectl expose deployment resource-consumer
$ kubectl get service
NAME                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
resource-consumer   ClusterIP   10.109.92.74   <none>        8080/TCP   38s
$
$ curl --data "millicores=250&durationSec=600" http://10.109.92.74:8080/ConsumeCPU
ConsumeCPU
250 millicores
600 durationSec

iaas-cpu01 では指定どおり25%前後でconsume-cpuが動いている

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29852 root      20   0    9368   6356   1344 S  25.9  0.2   0:20.92 consume-cpu

kubectl でみると 50% 前後で見えているのは何故?

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS     REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   49%/20%   1         5           1          4m

3つにスケールアウト、Usageが1/3 になっている → つまり、ここで見えている Usage は1つのPod 毎の平均 Usage

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   16%/20%   1         5         3          11m
oomichi commented 6 years ago

Cleanup

$ kubectl delete hpa resource-consumer; kubectl delete service resource-consumer; kubectl delete deployment resource-consumer

Run workload

$ kubectl run resource-consumer --image=gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3 --port 8080 --requests='cpu=500m,memory=256Mi'
$ kubectl autoscale deployment resource-consumer --min=1 --max=5 --cpu-percent=20
$ kubectl expose deployment resource-consumer
$ ADDRESS=`kubectl get service | grep resource-consumer | awk '{print $3}'`
$ curl --data "millicores=250&durationSec=600" http://${ADDRESS}:8080/ConsumeCPU

Continue checking hpa condition

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   49%/20%   1         5         2          4m
$
$ kubectl get pods -o wide
NAME                                 READY     STATUS    RESTARTS   AGE       IP             NODE
resource-consumer-767f57866f-8mb28   1/1       Running   0          1m        10.244.0.183   k8s-master
resource-consumer-767f57866f-w6drh   1/1       Running   0          5m        10.244.1.197   k8s-node01
$ top
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5594 root      20   0    9368   6560   1280 R  25.2  0.2   0:47.85 consume-cpu
oomichi commented 6 years ago

実際のプロセスは 25% なのに、なぜ kubectl get hpa では 50% 前後で見える? 25% は millicores で 250 (0.25 cores) を指定しているので正しそうだが。 → Deployment 作成の際に --requests='cpu=500m' を指定しているから 250 millicores 指定の際に 50% になっている。

500 millicores を指定した場合は、50% になるので負荷ツール自体はうまく動いているように見える

$ curl --data "millicores=500&durationSec=600" http://${ADDRESS}:8080/ConsumeCPU
...
$ top
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 8085 root      20   0    9368   6612   1344 S  50.3  0.2   0:10.11 consume-cpu

このとき、hpa としては2倍の 100% と見える

$ kubectl get hpa
NAME                REFERENCE                      TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   100%/20%   1         5         1          2m
oomichi commented 6 years ago

hpa が表示する CPU 利用率のロジックを確認する。

/apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers API の rows.cells 配列の3番目がソレ

$ kubectl get hpa -v=16
...
I0801 22:25:54.290584     379 request.go:897] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1beta1",
 "metadata":{"selfLink":"/apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers","resourceVersion":"1542106"},
 "columnDefinitions":[
 ...
 "rows":[
      {"cells":[
           "nginx",
           "Deployment/nginx",
           "\u003cunknown\u003e/50%",
           "1",
           3,
           1,
           "46m"],

pkg/registry/autoscaling/rest/storage_autoscaling.go

 28         horizontalpodautoscalerstore "k8s.io/kubernetes/pkg/registry/autoscaling/horizontalpodautoscaler/storage"
...
 33 func (p RESTStorageProvider) NewRESTStorage(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter) (    genericapiserver.APIGroupInfo, bool) {
 34         apiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(autoscaling.GroupName, legacyscheme.Scheme, legacyscheme.ParameterCodec, legacyscheme.Cod    ecs)
 35         // If you add a version here, be sure to add an entry in `k8s.io/kubernetes/cmd/kube-apiserver/app/aggregator.go with specific priorities.
 36         // TODO refactor the plumbing to provide the information in the APIGroupInfo
 37
 38         if apiResourceConfigSource.VersionEnabled(autoscalingapiv2beta1.SchemeGroupVersion) {
 39                 apiGroupInfo.VersionedResourcesStorageMap[autoscalingapiv2beta1.SchemeGroupVersion.Version] = p.v2beta1Storage(apiResourceConfigSource, r    estOptionsGetter)
 40         }
 41         if apiResourceConfigSource.VersionEnabled(autoscalingapiv1.SchemeGroupVersion) {
 42                 apiGroupInfo.VersionedResourcesStorageMap[autoscalingapiv1.SchemeGroupVersion.Version] = p.v1Storage(apiResourceConfigSource, restOptions    Getter)
 43         }
 44
 45         return apiGroupInfo, true
 46 }
 47
 48 func (p RESTStorageProvider) v1Storage(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter) map[st    ring]rest.Storage {
 49         storage := map[string]rest.Storage{}
 50         // horizontalpodautoscalers
 51         hpaStorage, hpaStatusStorage := horizontalpodautoscalerstore.NewREST(restOptionsGetter)
 52         storage["horizontalpodautoscalers"] = hpaStorage
 53         storage["horizontalpodautoscalers/status"] = hpaStatusStorage
 54
 55         return storage
 56 }

NOTE: pkg/registry ディレクトリ配下の rest というディレクトリに各リソースの REST API 実装がある

~/go/src/k8s.io/kubernetes/pkg/registry$ find . -name rest
./apps/rest
./settings/rest
./scheduling/rest
./rbac/rest
./batch/rest
./core/rest
./core/pod/rest
./core/node/rest
./extensions/rest
./policy/rest
./authentication/rest
./certificates/rest
./admissionregistration/rest
./networking/rest
./authorization/rest
./events/rest
./coordination/rest
./storage/rest
./autoscaling/rest

そこから辿って hpa の場合は詳細実装が pkg/registry/autoscaling/horizontalpodautoscaler/storage/storage.go にある

 38 // NewREST returns a RESTStorage object that will work against horizontal pod autoscalers.
 39 func NewREST(optsGetter generic.RESTOptionsGetter) (*REST, *StatusREST) {
 40         store := &genericregistry.Store{
 41                 NewFunc:                  func() runtime.Object { return &autoscaling.HorizontalPodAutoscaler{} },
 42                 NewListFunc:              func() runtime.Object { return &autoscaling.HorizontalPodAutoscalerList{} },
 43                 DefaultQualifiedResource: autoscaling.Resource("horizontalpodautoscalers"),
 44
 45                 CreateStrategy: horizontalpodautoscaler.Strategy,
 46                 UpdateStrategy: horizontalpodautoscaler.Strategy,
 47                 DeleteStrategy: horizontalpodautoscaler.Strategy,
 48
 49                 TableConvertor: printerstorage.TableConvertor{TablePrinter: printers.NewTablePrinter().With(printersinternal.AddHandlers)},
 50         }
 51         options := &generic.StoreOptions{RESTOptions: optsGetter}
 52         if err := store.CompleteWithOptions(options); err != nil {
 53                 panic(err) // TODO: Propagate error up
 54         }
 55
 56         statusStore := *store
 57         statusStore.UpdateStrategy = horizontalpodautoscaler.StatusStrategy
 58         return &REST{store}, &StatusREST{store: &statusStore}
 59 }

今回知りたいのは GET list of HPAs なので、たぶん autoscaling.HorizontalPodAutoscalerList k8s.io/kubernetes/pkg/apis/autoscaling

pkg/apis/autoscaling/types.go

413 // HorizontalPodAutoscalerList is a list of horizontal pod autoscaler objects.
414 type HorizontalPodAutoscalerList struct {
415         metav1.TypeMeta
416         // Metadata is the standard list metadata.
417         // +optional
418         metav1.ListMeta
419
420         // Items is the list of horizontal pod autoscaler objects.
421         Items []HorizontalPodAutoscaler
422 }
...
394 type HorizontalPodAutoscaler struct {
395         metav1.TypeMeta
396         // Metadata is the standard object metadata.
397         // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
398         // +optional
399         metav1.ObjectMeta
400
401         // Spec is the specification for the behaviour of the autoscaler.
402         // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status.
403         // +optional
404         Spec HorizontalPodAutoscalerSpec
405
406         // Status is the current information about the autoscaler.
407         // +optional
408         Status HorizontalPodAutoscalerStatus    これが CPU 使用率を示していそう
409 }
...
221 type HorizontalPodAutoscalerStatus struct {
222         // ObservedGeneration is the most recent generation observed by this autoscaler.
223         // +optional
224         ObservedGeneration *int64
225
226         // LastScaleTime is the last time the HorizontalPodAutoscaler scaled the number of pods,
227         // used by the autoscaler to control how often the number of pods is changed.
228         // +optional
229         LastScaleTime *metav1.Time
230
231         // CurrentReplicas is current number of replicas of pods managed by this autoscaler,
232         // as last seen by the autoscaler.
233         CurrentReplicas int32
234
235         // DesiredReplicas is the desired number of replicas of pods managed by this autoscaler,
236         // as last calculated by the autoscaler.
237         DesiredReplicas int32
238
239         // CurrentMetrics is the last read state of the metrics used by this autoscaler.
240         CurrentMetrics []MetricStatus  これっぽい
241
242         // Conditions is the set of conditions required for this autoscaler to scale its target,
243         // and indicates whether or not those conditions are met.
244         Conditions []HorizontalPodAutoscalerCondition
245 }
...
296 // MetricStatus describes the last-read state of a single metric.
297 type MetricStatus struct {
298         // Type is the type of metric source.  It will be one of "Object",
299         // "Pods" or "Resource", each corresponds to a matching field in the object.
300         Type MetricSourceType
301
302         // Object refers to a metric describing a single kubernetes object
303         // (for example, hits-per-second on an Ingress object).
304         // +optional
305         Object *ObjectMetricStatus
306         // Pods refers to a metric describing each pod in the current scale target
307         // (for example, transactions-processed-per-second).  The values will be
308         // averaged together before being compared to the target value.
309         // +optional
310         Pods *PodsMetricStatus
311         // Resource refers to a resource metric (such as those specified in
312         // requests and limits) known to Kubernetes describing each pod in the
313         // current scale target (e.g. CPU or memory). Such metrics are built in to
314         // Kubernetes, and have special scaling options on top of those available
315         // to normal per-pod metrics using the "pods" source.
316         // +optional
317         Resource *ResourceMetricStatus   これっぽい
318         // External refers to a global metric that is not associated
319         // with any Kubernetes object. It allows autoscaling based on information
320         // coming from components running outside of cluster
321         // (for example length of queue in cloud messaging service, or
322         // QPS from loadbalancer running outside of cluster).
323         // +optional
324         External *ExternalMetricStatus
325 }

正解っぽい

349 // ResourceMetricStatus indicates the current value of a resource metric known to
350 // Kubernetes, as specified in requests and limits, describing each pod in the
351 // current scale target (e.g. CPU or memory).  Such metrics are built in to
352 // Kubernetes, and have special scaling options on top of those available to
353 // normal per-pod metrics using the "pods" source.
354 type ResourceMetricStatus struct {
355         // Name is the name of the resource in question.
356         Name api.ResourceName
357         // CurrentAverageUtilization is the current value of the average of the
358         // resource metric across all relevant pods, represented as a percentage of
359         // the requested value of the resource for the pods.  It will only be
360         // present if `targetAverageValue` was set in the corresponding metric
361         // specification.
362         // +optional
363         CurrentAverageUtilization *int32
364         // CurrentAverageValue is the current value of the average of the
365         // resource metric across all relevant pods, as a raw value (instead of as
366         // a percentage of the request), similar to the "pods" metric source type.
367         // It will always be set, regardless of the corresponding metric specification.
368         CurrentAverageValue resource.Quantity
369 }
oomichi commented 6 years ago

deployment 作成の際に --requests='cpu=500m' を指定しているから 250 millicores 指定の際に 50% になっている? --requests='cpu=1000m' を指定したときに 250 millicores の負荷をかけたときの HPA の状態を確認する。 → 想定は hpa 上、CPU 使用率が 25% として見えること(500mのときは50%だった)

$ kubectl run resource-consumer --image=gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3 --port 8080 --requests='cpu=1000m,memory=256Mi'
$ kubectl autoscale deployment resource-consumer --min=1 --max=5 --cpu-percent=20
$ kubectl expose deployment resource-consumer
$ ADDRESS=`kubectl get service | grep resource-consumer | awk '{print $3}'`
$ curl --data "millicores=250&durationSec=600" http://${ADDRESS}:8080/ConsumeCPU

hpa 状態を確認 → 想定どおり。更に pods が2つにスケールし、12%になることを想定。

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   24%/20%   1         5         1          2m

一定時間後 hpa 状態を確認 → 想定どおり

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   12%/20%   1         5         2          4m

やっぱり、k/kubernetes/pull/66982 は必須じゃん!

oomichi commented 6 years ago

上記の知識を得た状態で、問題となっている e2e テストのテスト条件を再度確認する → scaleUp については妥当に見える test/e2e/autoscaling/horizontal_pod_autoscaling.go

133 func scaleUp(name string, kind schema.GroupVersionKind, checkStability bool, rc *common.ResourceConsumer, f *framework.Framework) {
134         stasis := 0 * time.Minute
135         if checkStability {
136                 stasis = 10 * time.Minute
137         }
138         scaleTest := &HPAScaleTest{
139                 initPods:                    1,    // 初期のPod数
140                 totalInitialCPUUsage:        250,  // 初期スケールアウト時に負荷Podが行う CPU負荷(25%)
141                 perPodCPURequest:            500,  // 負荷 Pod の要求 CPU使用率 (50%)、つまり HPA上はCPU負荷が50%(25% / 50%)
142                 targetCPUUtilizationPercent: 20,  // 目標とする 1 Podあたりの CPU負荷 (20%)、平均がその値になるように Pod 数を増減させる
143                 minPods:                     1,
144                 maxPods:                     5,
145                 firstScale:                  3,  // 最初のスケールアウト時の想定 Pod 数。目標CPU負荷20%に対してHPA上のCPU負荷が50% になるため、3つのPodsを作り平均17%以内に収める。妥当
146                 firstScaleStasis:            stasis,
147                 cpuBurst:                    700,  // 2回目のスケールアウト時にかける CPU 負荷。perPodCPURequest が 500 のため、HPA 上は CPU 負荷が 140% となる
148                 secondScale:                 5,  // 2回目のスケールアウト時の想定 Pod 数。targetCPUUtilizationPercent が20のため、本来は 7 個の Pods が必要だが(=140/20)が maxPods が5のため、この数になる
149         }
150         scaleTest.run(name, kind, rc, f)
151 }

問題のテスト実行時の状況を確認する 一番シンプルな2段階スケールアウトテストを試す

[Fail] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] Deployment [It] Should scale from 1 pod to 3 pods and from 3 to 5

Pod数3想定に対して4つになっているのが問題

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-t5rkx   test-deployment   Deployment/test-deployment   95%/20%   1         5         4          2m

負荷 Pod の要求 CPU使用率 (50%, perPodCPURequest: 500)、つまり HPA上はCPU負荷が50%(25% / 50%) になるはずが 95% になっているのが問題

プロセス単位でContainer負荷が50%になっていることを確認する (2重実行されている可能性もあり?シリアル/パラレルテスト実行もチェック)

$ kubectl get hpa --all-namespaces
NAMESPACE                                    NAME              REFERENCE                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
e2e-tests-horizontal-pod-autoscaling-fhrzw   test-deployment   Deployment/test-deployment   99%/20%   1         5         1          1m

負荷プロセスがいっぱい

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1065 root      20   0  948596 139176  54948 S  14.2  3.4 676:29.88 kubelet
25666 root      20   0    9368   6044   1280 S   4.0  0.1   0:00.91 consume-cpu
25679 root      20   0    9368   5788   1284 S   4.0  0.1   0:00.89 consume-cpu
25775 root      20   0    9368   6248   1220 R   4.0  0.2   0:00.93 consume-cpu
25665 root      20   0    9432   6296   1280 S   3.6  0.2   0:00.91 consume-cpu
25667 root      20   0    9368   6352   1344 S   3.6  0.2   0:00.89 consume-cpu
25685 root      20   0    9368   5780   1280 R   3.6  0.1   0:00.90 consume-cpu
25694 root      20   0    9368   6048   1284 S   3.6  0.1   0:00.90 consume-cpu
25698 root      20   0    9368   6048   1280 S   3.6  0.1   0:00.89 consume-cpu
25702 root      20   0    9368   6044   1268 S   3.6  0.1   0:00.90 consume-cpu
25703 root      20   0    9368   6348   1344 S   3.6  0.2   0:00.89 consume-cpu
25741 root      20   0    9368   6092   1344 S   3.6  0.2   0:00.90 consume-cpu
25763 root      20   0    9368   6036   1280 S   3.6  0.1   0:00.91 consume-cpu
25675 root      20   0    9368   6368   1344 S   3.3  0.2   0:00.87 consume-cpu

3.6%前後の負荷ツールプロセスが13、50%前後の CPU 負荷がかかっている(本来25%を想定) 手動で試したときと負荷のかかり方が異なる(負荷ツールは1プロセスだったし、合計も異なる)

テストログから CPU 負荷に関する項目を抽出

Aug  6 23:34:50.709: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:35:20.901: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:35:50.987: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:36:21.086: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:36:51.189: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:37:21.367: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:37:51.484: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
Aug  6 23:38:21.617: INFO: ConsumeCPU URL: {https   192.168.1.108:6443 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-fhrzw/services/test-deployment-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=20 }
oomichi commented 6 years ago

上記 e2e テストログの負荷を試してみる

$ kubectl run resource-consumer --image=gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3 --port 8080 --requests='cpu=500m,memory=256Mi'
$ kubectl autoscale deployment resource-consumer --min=1 --max=5 --cpu-percent=20
$ kubectl expose deployment resource-consumer
$ ADDRESS=`kubectl get service | grep resource-consumer | awk '{print $3}'`
$ curl --data "durationSec=30&millicores=250&requestSizeMillicores=20" http://${ADDRESS}:8080/ConsumeCPU

複数プロセスにはならない

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29246 root      20   0    9304   5212   1220 S  24.8  0.1   0:01.90 consume-cpu

テストの想定どおり3になる

$ kubectl get hpa
NAME                REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
resource-consumer   Deployment/resource-consumer   16%/20%   1         5         3          9m

負荷ツールのコードを見ると、複数プロセスにならないのは変 test/images/resource-consumer/controller/controller.go 250(millicores) / 20(requestSizeMillicores) = 12.5 で13になる

 99         count := millicores / requestSizeInMillicores
100         rest := millicores - count*requestSizeInMillicores
101         fmt.Fprintf(w, "RC manager: sending %v requests to consume %v millicores each and 1 request to consume %v millicores\n", count, requestSizeInMill    icores, rest)
102         if count > 0 {
103                 handler.waitGroup.Add(count)
104                 handler.sendConsumeCPURequests(w, count, requestSizeInMillicores, durationSec)
105         }
106         if rest > 0 {
107                 handler.waitGroup.Add(1)
108                 go handler.sendOneConsumeCPURequest(w, rest, durationSec)
109         }
110         handler.waitGroup.Wait()
111 }
oomichi commented 6 years ago

Imageが gcr.io/kubernetes-e2e-test-images/resource-consumer:1.3 では無い? 最新のコードだとrequestSizeMillicoresを指定しないとエラーになるはずが、手動実行だとエラーにならないし。

e2e のテストログ上、どのイメージを使っているかは出力されない。

$ kubectl describe deployment <deployment名>

で確認する。 rs テストのときは

    Image:      gcr.io/kubernetes-e2e-test-images/resource-consumer/controller-amd64:1.0

deployment のときは

   Image:      gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3

詳細

$ kubectl describe deployments test-deployment -n=e2e-tests-horizontal-pod-autoscaling-cndn6
Name:                   test-deployment
Namespace:              e2e-tests-horizontal-pod-autoscaling-cndn6
CreationTimestamp:      Tue, 07 Aug 2018 00:35:11 +0000
Labels:                 name=test-deployment
Annotations:            deployment.kubernetes.io/revision=1
Selector:               name=test-deployment
Replicas:               4 desired | 4 updated | 4 total | 4 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:  name=test-deployment
  Containers:
   test-deployment:
    Image:      gcr.io/kubernetes-e2e-test-images/resource-consumer-amd64:1.3
    Port:       80/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     500m
      memory:  209715200
    Requests:
      cpu:        500m
      memory:     209715200
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   test-deployment-5fd8dc46f8 (4/4 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  2m    deployment-controller  Scaled up replica set test-deployment-5fd8dc46f8 to 1
  Normal  ScalingReplicaSet  34s   deployment-controller  Scaled up replica set test-deployment-5fd8dc46f8 to 4
$ kubectl describe hpa test-deployment -n=e2e-tests-horizonta
l-pod-autoscaling-cndn6
Name:                                                  test-deployment
Namespace:                                             e2e-tests-horizontal-pod-autoscaling-cndn6
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Tue, 07 Aug 2018 00:35:36 +0000
Reference:                                             Deployment/test-deployment
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  66% (334m) / 20%
Min replicas:                                          1
Max replicas:                                          5
Deployment pods:                                       4 current / 4 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     False   BackoffBoth         the time since the previous scale is still within both the downscale and upscale forbidden windows
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                        Age              From                       Message
  ----     ------                        ----             ----                       -------
  Warning  FailedGetResourceMetric       2m (x2 over 2m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  2m (x2 over 2m)  horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Normal   SuccessfulRescale             1m               horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target
oomichi commented 6 years ago

あぁ、なんかわかってきた。 resource-consumerのCPU消費に関するHTTP query の処理が二箇所ある。

controller/controller.go:               handler.handleConsumeCPU(w, req.Form)
controller/controller.go:func (handler *Controller) handleConsumeCPU(w http.ResponseWriter, query url.Values) {
resource_consumer_handler.go:           handler.handleConsumeCPU(w, req.Form)
resource_consumer_handler.go:func (handler *ResourceConsumerHandler) handleConsumeCPU(w http.ResponseWriter, query url.Values) {

たぶん、e2eテストは controller/controller.go 経由でリクエストしており、それが RequestSizeInMillicoresQuery をサポートしている。 そして、それが複数回 resource_consumer_handler.go をコールしていたと考えられる。 手動テストでは resource_consumer_handler.go を直接コールしていたので、RequestSizeInMillicoresQuery をサポートしていなかったのでは?

oomichi commented 6 years ago

下記からすると、本来1プロセスあたり 2% しか負荷がかかるべきではないところが、4% かかっている?

durationSec=30&millicores=250&requestSizeMillicores=20
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1065 root      20   0  948596 139176  54948 S  14.2  3.4 676:29.88 kubelet
25666 root      20   0    9368   6044   1280 S   4.0  0.1   0:00.91 consume-cpu
25679 root      20   0    9368   5788   1284 S   4.0  0.1   0:00.89 consume-cpu
25775 root      20   0    9368   6248   1220 R   4.0  0.2   0:00.93 consume-cpu
25665 root      20   0    9432   6296   1280 S   3.6  0.2   0:00.91 consume-cpu
25667 root      20   0    9368   6352   1344 S   3.6  0.2   0:00.89 consume-cpu
25685 root      20   0    9368   5780   1280 R   3.6  0.1   0:00.90 consume-cpu
25694 root      20   0    9368   6048   1284 S   3.6  0.1   0:00.90 consume-cpu
25698 root      20   0    9368   6048   1280 S   3.6  0.1   0:00.89 consume-cpu
25702 root      20   0    9368   6044   1268 S   3.6  0.1   0:00.90 consume-cpu
25703 root      20   0    9368   6348   1344 S   3.6  0.2   0:00.89 consume-cpu
25741 root      20   0    9368   6092   1344 S   3.6  0.2   0:00.90 consume-cpu
25763 root      20   0    9368   6036   1280 S   3.6  0.1   0:00.91 consume-cpu
25675 root      20   0    9368   6368   1344 S   3.3  0.2   0:00.87 consume-cpu

Bingっぽい。2%で3つ負荷をかけたところ、6%以上の負荷がかかっている。

$ curl --data "durationSec=300&millicores=20" http://${ADDRESS}:8080/ConsumeCPU
$ curl --data "durationSec=300&millicores=20" http://${ADDRESS}:8080/ConsumeCPU
$ curl --data "durationSec=300&millicores=20" http://${ADDRESS}:8080/ConsumeCPU
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15096 root      20   0    9368   6252   1220 S   6.9  0.2   0:06.50 consume-cpu
15156 root      20   0    9368   6252   1220 S   6.6  0.2   0:05.46 consume-cpu
15067 root      20   0    9368   6288   1280 S   6.3  0.2   0:07.69 consume-cpu

25% でかけると、想定どおりの負荷がかかる。

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
16724 root      20   0    8180   4492   1288 R  25.0  0.1   0:01.35 consume-cpu

負荷が想定どおりかからない閾値を見つける。 10%は想定どおり

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15846 root      20   0    9368   6036   1280 R  10.0  0.1   0:01.92 consume-cpu

5%は駄目

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
16200 root      20   0    9304   6368   1344 S   7.0  0.2   0:00.79 consume-cpu
oomichi commented 6 years ago

問題のコードは test/e2e/common/autoscaling_utils.go

 46         dynamicRequestSizeInMillicores  = 20

k/kubernetes/issues/67051 として Issue 登録済み k/kubernetes/pull/67053 として PR 投稿済み

oomichi commented 6 years ago

--check-version-skew=false オプションを指定して実行したところ、Pod の3から5のオートスケールで失敗。原因は CPU リソース不足 → テスト中断時に残った deployment, rc, rs が原因、Cleanup 実施

$ kubectl get pods -n=e2e-tests-horizontal-pod-autoscaling-fcq2t
NAME                               READY     STATUS    RESTARTS   AGE
test-deployment-5fd8dc46f8-9j59v   1/1       Running   0          9m
test-deployment-5fd8dc46f8-d5c42   0/1       Pending   0          5m
test-deployment-5fd8dc46f8-lxfbp   0/1       Pending   0          5m
test-deployment-5fd8dc46f8-nwltt   1/1       Running   0          10m
test-deployment-5fd8dc46f8-wz68c   1/1       Running   0          9m
$
$ kubectl describe pod test-deployment-5fd8dc46f8-d5c42 -n=e2e-tests-horizontal-pod-autoscaling-fcq2t
..
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  24s (x63 over 5m)  default-scheduler  0/2 nodes are available: 2 Insufficient cpu.

k/kubernetes/pull/67053 を適用したところ問題だった2つの HPA テストが通るようになった。

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=from\s3\sto\s5$" --check-version-skew=false
..
Ran 2 of 999 Specs in 752.086 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 997 Skipped PASS

Ginkgo ran 1 suite in 12m32.321622799s
Test Suite Passed
2018/08/07 17:54:46 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=from\s3\sto\s5$' finished in 12m32.529859891s
2018/08/07 17:54:46 e2e.go:83: Done
oomichi commented 6 years ago

HPA テストを全実行してみる → 通るようになった!

$ go run hack/e2e.go -- --provider=skeleton --test --test_args="--ginkgo.focus=\[HPA\]" --check-version-skew=false
...
S [SKIPPING] in Spec Setup (BeforeEach) [0.001 seconds]
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/framework.go:22
  should scale down with External Metric with target value from Stackdriver [Feature:CustomMetricsAutoscaling] [BeforeEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/custom_metrics_stackdriver_autoscaling.go:85

  Aug  7 19:12:17.376: Only supported for providers [gce gke] (not skeleton)

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/util.go:305
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSAug  7 19:12:17.378: INFO: Running AfterSuite actions on all node
Aug  7 19:12:17.378: INFO: Running AfterSuite actions on node 1

Ran 8 of 999 Specs in 4484.970 seconds
SUCCESS! -- 8 Passed | 0 Failed | 0 Pending | 991 Skipped PASS

Ginkgo ran 1 suite in 1h14m45.203539796s
Test Suite Passed
2018/08/07 19:12:17 process.go:155: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[HPA\]' finished in 1h14m45.413142012s
2018/08/07 19:12:17 e2e.go:83: Done
oomichi commented 6 years ago

k/kubernetes/pull/67053 がマージされたら close する

oomichi commented 6 years ago

Close!!