Closed oomichi closed 5 years ago
$ go run hack/e2e.go --provider=skeleton --test --test_args="--ginkgo.focus=should\screate\sand\sstop\sa\sreplication\scontroller --master-os-distro=ubuntu --node-os-distro=ubuntu" --check-version-skew=false
<snip>
~ Failure [312.939 seconds]
[sig-cli] Kubectl client
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl/framework.go:22
[k8s.io] Update Demo
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:694
should create and stop a replication controller [Conformance] [It]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:699
Nov 13 23:22:20.185: Timed out after 300 seconds waiting for name=update-demo pods to reach valid state
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/rc_util.go:292
テストコード → これ自体はだいぶ前に Conformance になったみたい。よって、テストではなく環境が壊れている? test/e2e/kubectl/kubectl.go
303 framework.ConformanceIt("should create and stop a replication controller ", func() {
304 defer cleanupKubectlInputs(nautilus, ns, updateDemoSelector)
305
306 By("creating a replication controller")
307 framework.RunKubectlOrDieInput(nautilus, "create", "-f", "-", fmt.Sprintf("--namespace=%v", ns))
308 framework.ValidateController(c, nautilusImage, 2, "update-demo", updateDemoSelector, getUDData("nautilus.jpg", ns), ns)
309 })
最後のValidateController で失敗している。
231 // ValidateController is a generic mechanism for testing RC's that are running.
232 // It takes a container name, a test name, and a validator function which is plugged in by a specific test.
233 // "containername": this is grepped for.
234 // "containerImage" : this is the name of the image we expect to be launched. Not to confuse w/ images (kitten.jpg) which are validated.
235 // "testname": which gets bubbled up to the logging/failure messages if errors happen.
236 // "validator" function: This function is given a podID and a client, and it can do some specific validations that way.
237 func ValidateController(c clientset.Interface, containerImage string, replicas int, containername string, testname string, validator validatorFn, ns stri ng) {
238 containerImage = trimDockerRegistry(containerImage)
239 getPodsTemplate := "--template={{range.items}}{{.metadata.name}} {{end}}"
240 // NB: kubectl adds the "exists" function to the standard template functions.
241 // This lets us check to see if the "running" entry exists for each of the containers
242 // we care about. Exists will never return an error and it's safe to check a chain of
243 // things, any one of which may not exist. In the below template, all of info,
244 // containername, and running might be nil, so the normal index function isn't very
245 // helpful.
246 // This template is unit-tested in kubectl, so if you change it, update the unit test.
247 // You can read about the syntax here: http://golang.org/pkg/text/template/.
248 getContainerStateTemplate := fmt.Sprintf(`--template={{if (exists . "status" "containerStatuses")}}{{range .status.containerStatuses}}{{if (and ( eq .name "%s") (exists . "state" "running"))}}true{{end}}{{end}}{{end}}`, containername)
249
250 getImageTemplate := fmt.Sprintf(`--template={{if (exists . "status" "containerStatuses")}}{{range .status.containerStatuses}}{{if eq .name "%s"}} {{.image}}{{end}}{{end}}{{end}}`, containername)
251
252 By(fmt.Sprintf("waiting for all containers in %s pods to come up.", testname)) //testname should be selector
253 waitLoop:
254 for start := time.Now(); time.Since(start) < PodStartTimeout; time.Sleep(5 * time.Second) {
255 getPodsOutput := RunKubectlOrDie("get", "pods", "-o", "template", getPodsTemplate, "-l", testname, fmt.Sprintf("--namespace=%v", ns))
256 pods := strings.Fields(getPodsOutput)
257 if numPods := len(pods); numPods != replicas {
258 By(fmt.Sprintf("Replicas for %s: expected=%d actual=%d", testname, replicas, numPods))
259 continue
260 }
261 var runningPods []string
262 for _, podID := range pods {
263 running := RunKubectlOrDie("get", "pods", podID, "-o", "template", getContainerStateTemplate, fmt.Sprintf("--namespace=%v", ns))
264 if running != "true" {
265 Logf("%s is created but not running", podID)
266 continue waitLoop
267 }
268
269 currentImage := RunKubectlOrDie("get", "pods", podID, "-o", "template", getImageTemplate, fmt.Sprintf("--namespace=%v", ns))
270 currentImage = trimDockerRegistry(currentImage)
271 if currentImage != containerImage {
272 Logf("%s is created but running wrong image; expected: %s, actual: %s", podID, containerImage, currentImage)
273 continue waitLoop
274 }
275
276 // Call the generic validator function here.
277 // This might validate for example, that (1) getting a url works and (2) url is serving correct content.
278 if err := validator(c, podID); err != nil {
279 Logf("%s is running right image but validator function failed: %v", podID, err)
280 continue waitLoop
281 }
282
283 Logf("%s is verified up and running", podID)
284 runningPods = append(runningPods, podID)
285 }
286 // If we reach here, then all our checks passed.
287 if len(runningPods) == replicas {
288 return
289 }
290 }
291 // Reaching here means that one of more checks failed multiple times. Assuming its not a race condition, something is broken.
292 Failf("Timed out after %v seconds waiting for %s pods to reach valid state", PodStartTimeout.Seconds(), testname)
293 }
以下のようにimage 名が想定のものと異なるため、エラーとなった模様。 CPUアーキテクチャ名がイメージ名についている。 以前も似たようなことがあったなぁ。
Nov 14 02:13:01.926: INFO: update-demo-nautilus-rxzx5 is created but running wrong image;
expected: gcr.io/kubernetes-e2e-test-images/nautilus:1.0,
actual: gcr.io/kubernetes-e2e-test-images/nautilus-amd64:1.0
想定イメージ名 nautilusImage
92 nautilusImage = imageutils.GetE2EImage(imageutils.Nautilus)
test/utils/image/manifest.go
Nautilus = ImageConfig{e2eRegistry, "nautilus", "1.0"}
GCP上はアーキテクチャ名のあり・無し両方のイメージがある。
手元で deployment を作成した場合は Image 名は指定したものになっている。
$ kubectl run test01 --image=gcr.io/kubernetes-e2e-test-images/nautilus:1.0
deployment.apps/test01 created
$ kubectl describe pod test01-69fc696fbf-7ncmn
Name: test01-69fc696fbf-7ncmn
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 113-node01/192.168.1.103
Start Time: Wed, 14 Nov 2018 02:41:40 +0000
Labels: pod-template-hash=69fc696fbf
run=test01
Annotations: <none>
Status: Running
IP: 10.244.1.108
Controlled By: ReplicaSet/test01-69fc696fbf
Containers:
test01:
Container ID: docker://ee5a65809f02b263e02d3dbdb84f3334470cbd85858c65795ca5c48409009b45
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Image ID: docker://sha256:e0c9cf0dd72a7e2cafa14652c0152afcfaa773bb692e64c05f6ff5e34cecae33
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 14 Nov 2018 02:41:42 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wnjjs (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-wnjjs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wnjjs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21s default-scheduler Successfully assigned default/test01-69fc696fbf-7ncmn to 113-node01
Normal Pulled 19s kubelet, 113-node01 Container image "gcr.io/kubernetes-e2e-test-images/nautilus:1.0" already present on machine
Normal Created 19s kubelet, 113-node01 Created container
Normal Started 19s kubelet, 113-node01 Started container
テスト上の pod 作成処理
nautilus = substituteImageName(string(generated.ReadOrDie(filepath.Join(updateDemoRoot, "nautilus-rc.yaml.in"))))
..
framework.RunKubectlOrDieInput(nautilus, "create", "-f", "-", fmt.Sprintf("--namespace=%v", ns))
test/e2e/framework/util.go
2292 // RunKubectlOrDieInput is a convenience wrapper over kubectlBuilder that takes input to stdin
2293 func RunKubectlOrDieInput(data string, args ...string) string {
2294 return NewKubectlCommand(args...).WithStdinData(data).ExecOrDie()
2295 }
test/fixtures/doc-yaml/user-guide/update-demo/nautilus-rc.yaml.in
1 apiVersion: v1
2 kind: ReplicationController
3 metadata:
4 name: update-demo-nautilus
5 spec:
6 replicas: 2
7 selector:
8 name: update-demo
9 version: nautilus
10 template:
11 metadata:
12 labels:
13 name: update-demo
14 version: nautilus
15 spec:
16 containers:
17 - image: {{.NautilusImage}}
18 name: update-demo
19 ports:
20 - containerPort: 80
21 protocol: TCP
上記のマニフェストでやってみる。 → 再現せず。rc上もpod上も image は amd64 無しで正しく表示されている。
$ cat hoge.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: update-demo-nautilus
spec:
replicas: 2
selector:
name: update-demo
version: nautilus
template:
metadata:
labels:
name: update-demo
version: nautilus
spec:
containers:
- image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
name: update-demo
ports:
- containerPort: 80
protocol: TCP
$ kubectl create -f hoge.yaml
replicationcontroller/update-demo-nautilus created
$ kubectl describe rc update-demo-nautilus
Name: update-demo-nautilus
Namespace: default
Selector: name=update-demo,version=nautilus
Labels: name=update-demo
version=nautilus
Annotations: <none>
Replicas: 2 current / 2 desired
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=update-demo
version=nautilus
Containers:
update-demo:
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 3m replication-controller Created pod: update-demo-nautilus-4bc4q
Normal SuccessfulCreate 3m replication-controller Created pod: update-demo-nautilus-7q62d
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-64f497f8fd-wtpzf 1/1 Running 0 7d18h
standalone-cinder-provisioner-7d6594d789-vtcbv 1/1 Running 0 33d
update-demo-nautilus-4bc4q 1/1 Running 0 22s
update-demo-nautilus-7q62d 1/1 Running 0 22s
$ kubectl describe pod update-demo-nautilus-4bc4q
Name: update-demo-nautilus-4bc4q
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 113-node01/192.168.1.103
Start Time: Wed, 14 Nov 2018 03:07:26 +0000
Labels: name=update-demo
version=nautilus
Annotations: <none>
Status: Running
IP: 10.244.1.109
Controlled By: ReplicationController/update-demo-nautilus
Containers:
update-demo:
Container ID: docker://f805e2beff7bbe4b61b7c40ca22e387a0491efe3df59fb22a0fdef293e5f8a28
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Image ID: docker://sha256:e0c9cf0dd72a7e2cafa14652c0152afcfaa773bb692e64c05f6ff5e34cecae33
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 14 Nov 2018 03:07:28 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wnjjs (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-wnjjs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wnjjs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35s default-scheduler Successfully assigned default/update-demo-nautilus-4bc4q to 113-node01
Normal Pulled 34s kubelet, 113-node01 Container image "gcr.io/kubernetes-e2e-test-images/nautilus:1.0" already present on machine
Normal Created 34s kubelet, 113-node01 Created container
Normal Started 33s kubelet, 113-node01 Started container
$ kubectl describe pod update-demo-nautilus-7q62d
Name: update-demo-nautilus-7q62d
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 113-node02/192.168.1.112
Start Time: Wed, 14 Nov 2018 03:07:26 +0000
Labels: name=update-demo
version=nautilus
Annotations: <none>
Status: Running
IP: 10.244.2.23
Controlled By: ReplicationController/update-demo-nautilus
Containers:
update-demo:
Container ID: docker://bf5e5a13e6f4fd3816a06b20c7f4503f0c1b99cf21e2476eaebea7c764304230
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Image ID: docker://sha256:e0c9cf0dd72a7e2cafa14652c0152afcfaa773bb692e64c05f6ff5e34cecae33
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 14 Nov 2018 03:07:28 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-wnjjs (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-wnjjs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-wnjjs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned default/update-demo-nautilus-7q62d to 113-node02
Normal Pulled 1m kubelet, 113-node02 Container image "gcr.io/kubernetes-e2e-test-images/nautilus:1.0" already present on machine
Normal Created 1m kubelet, 113-node02 Created container
Normal Started 1m kubelet, 113-node02 Started container
テスト実施中の状態を取ってみた → Image名に amd64 は含まれないが、テストは何処を見ている!?
$ kubectl get rc -n e2e-tests-kubectl-zvthv
NAME DESIRED CURRENT READY AGE
update-demo-nautilus 2 2 2 16s
ubuntu@e2e:~$ kubectl -n e2e-tests-kubectl-zvthv describe rc update-demo-nautilus
Name: update-demo-nautilus
Namespace: e2e-tests-kubectl-zvthv
Selector: name=update-demo,version=nautilus
Labels: name=update-demo
version=nautilus
Annotations: <none>
Replicas: 2 current / 2 desired
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=update-demo
version=nautilus
Containers:
update-demo:
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 30s replication-controller Created pod: update-demo-nautilus-rhkp4
Normal SuccessfulCreate 29s replication-controller Created pod: update-demo-nautilus-nz7jm
$
$ kubectl -n e2e-tests-kubectl-zvthv get pods
NAME READY STATUS RESTARTS AGE
update-demo-nautilus-nz7jm 1/1 Running 0 36s
update-demo-nautilus-rhkp4 1/1 Running 0 36s
$
$ kubectl -n e2e-tests-kubectl-zvthv describe pod update-demo-nautilus-nz7jm
Name: update-demo-nautilus-nz7jm
Namespace: e2e-tests-kubectl-zvthv
Priority: 0
PriorityClassName: <none>
Node: 113-node02/192.168.1.112
Start Time: Wed, 14 Nov 2018 03:14:29 +0000
Labels: name=update-demo
version=nautilus
Annotations: <none>
Status: Running
IP: 10.244.2.24
Controlled By: ReplicationController/update-demo-nautilus
Containers:
update-demo:
Container ID: docker://a5ff2d030d91f967c16c9bd2165c402c0e18e267f9e4ab52e59529f9d16c143d
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Image ID: docker://sha256:e0c9cf0dd72a7e2cafa14652c0152afcfaa773bb692e64c05f6ff5e34cecae33
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 14 Nov 2018 03:14:31 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-btlp2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-btlp2:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-btlp2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned e2e-tests-kubectl-zvthv/update-demo-nautilus-nz7jm to 113-node02
Normal Pulled 45s kubelet, 113-node02 Container image "gcr.io/kubernetes-e2e-test-images/nautilus:1.0" already present on machine
Normal Created 45s kubelet, 113-node02 Created container
Normal Started 45s kubelet, 113-node02 Started container
$
$ kubectl -n e2e-tests-kubectl-zvthv describe pod update-demo-nautilus-rhkp4
Name: update-demo-nautilus-rhkp4
Namespace: e2e-tests-kubectl-zvthv
Priority: 0
PriorityClassName: <none>
Node: 113-node01/192.168.1.103
Start Time: Wed, 14 Nov 2018 03:14:29 +0000
Labels: name=update-demo
version=nautilus
Annotations: <none>
Status: Running
IP: 10.244.1.110
Controlled By: ReplicationController/update-demo-nautilus
Containers:
update-demo:
Container ID: docker://6722659e58ccc522f8a01ce064628e15b05c751c61e2c42d74f7d3474f16645f
Image: gcr.io/kubernetes-e2e-test-images/nautilus:1.0
Image ID: docker://sha256:e0c9cf0dd72a7e2cafa14652c0152afcfaa773bb692e64c05f6ff5e34cecae33
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 14 Nov 2018 03:14:31 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-btlp2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-btlp2:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-btlp2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 59s default-scheduler Successfully assigned e2e-tests-kubectl-zvthv/update-demo-nautilus-rhkp4 to 113-node01
Normal Pulled 57s kubelet, 113-node01 Container image "gcr.io/kubernetes-e2e-test-images/nautilus:1.0" already present on machine
Normal Created 57s kubelet, 113-node01 Created container
Normal Started 57s kubelet, 113-node01 Started container
Podのイメージ名取得処理が特殊? 以下のようなコマンドでとっている模様
kubectl get pods <pod-id> -o --template={{if (exists . "status" "containerStatuses")}}{{range .status.containerStatuses}}{{if eq .name "update-demo"}}{{.image}}{{end}}{{end}}{{end}} -n <namespace>
templateオプションはgo の標準的なものを使っているみたい
--template='': Template string or path to template file to use when -o=go-template, -o=go-template-file.
The template format is golang templates [http://golang.org/pkg/text/template/#pkg-overview].
新たに e2e テスト用の VM を作って実行してみたが、再現してしまった。 たぶん、k8s クラスタ側の問題だと思う。 根本原因を dig ったほうが良さそう。
本問題の予想再現シナリオ
v.1.12.x から v.13.1 でも再現するか試してみる。
# apt-get install -y kubelet=1.12.4-00 kubeadm=1.12.4-00 kubectl=1.12.4-00 kubernetes-cni
本問題の予想再現シナリオ
実行時間: 1時間46分 失敗テスト
新規 v1.13.1 環境だと成功している。 実行時間は2時間7分