Closed shyamjvs closed 6 years ago
assigning oncall, will be in meetings mostly until this afternoon. /assign @Q-Lee
evicted?!
It just happened again:
looks like upload artifacts somehow killed the pod
It's deleted by the garbage collector and recreated by plank.. although I cannot find anything from the kubelet log..
I have no knowledge around how garbage collector works? cc @dashpole if you know what can be happening here.. prow is at 1.8.7 now
Further info: the node gke-prow-pool-n1-highmem-8-81ce4395-p87n
was gc'd and rebooted, still looking into why
A I0424 17:34:08.758639 21999 server.go:779] GET /healthz: (28.755µs) 200 [[Go-http-client/1.1] 127.0.0.1:48934]
A I0424 17:34:09.606962 21999 server.go:779] GET /healthz: (25.299µs) 200 [[curl/7.56.1] 127.0.0.1:48946]
A I0424 17:34:19.614586 21999 server.go:779] GET /healthz: (40.083µs) 200 [[curl/7.56.1] 127.0.0.1:49064]
A I0424 17:34:29.622445 21999 server.go:779] GET /healthz: (31.071µs) 200 [[curl/7.56.1] 127.0.0.1:49174]
A I0424 17:34:39.629337 21999 server.go:779] GET /healthz: (24.758µs) 200 [[curl/7.56.1] 127.0.0.1:49292]
A I0424 17:34:44.296633 21999 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
A I0424 17:34:49.636328 21999 server.go:779] GET /healthz: (30.911µs) 200 [[curl/7.56.1] 127.0.0.1:49426]
A I0424 17:34:59.643691 21999 server.go:779] GET /healthz: (31.338µs) 200 [[curl/7.56.1] 127.0.0.1:49534]
A E0424 17:35:05.235789 21999 remote_image.go:67] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
A E0424 17:35:05.236329 21999 kuberuntime_image.go:140] ListImages failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
A E0424 17:35:05.236641 21999 handler.go:276] HTTP InternalServerError serving /stats/summary: Internal Error: failed to get imageFs stats: failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded
A I0424 17:35:05.236862 21999 server.go:779] GET /stats/summary/: (2m0.002025429s) 500
A goroutine 2088134 [running]:
A k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc421529570, 0x1f4)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
A k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc421529570, 0x1f4)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35
A k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(0xc422840120, 0x1f4)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:201 +0x41
A k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc422840120, 0x1f4, 0xc4216aecf0, 0x8b, 0x2, 0xc4216aecf0)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:181 +0x46
A k8s.io/kubernetes/pkg/kubelet/server/stats.handleError(0xc422840120, 0x37a96d7, 0xe, 0x536b920, 0xc422dc8ce0)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:277 +0x1da
A k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).handleSummary(0xc420972aa0, 0xc422b05d40, 0xc422840120)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:188 +0x79
A k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).(k8s.io/kubernetes/pkg/kubelet/server/stats.handleSummary)-fm(0xc422b05d40, 0xc422840120)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:100 +0x3e
A k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc4206be510, 0x5395de0, 0xc421529570, 0xc421df2600)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:277 +0xb8d
A k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm(0x5395de0, 0xc421529570, 0xc421df2600)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:120 +0x48
A net/http.HandlerFunc.ServeHTTP(0xc420465940, 0x5395de0, 0xc421529570, 0xc421df2600)
A /usr/local/go/src/net/http/server.go:1942 +0x44
A net/http.(*ServeMux).ServeHTTP(0xc4203e3020, 0x5395de0, 0xc421529570, 0xc421df2600)
A /usr/local/go/src/net/http/server.go:2238 +0x130
A k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).ServeHTTP(0xc4206be510, 0x5395de0, 0xc421529570, 0xc421df2600)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:292 +0x4d
A k8s.io/kubernetes/pkg/kubelet/server.(*Server).ServeHTTP(0xc4209faaf0, 0x5395de0, 0xc421529570, 0xc421df2600)
A /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:778 +0x110
A net/http.serverHandler.ServeHTTP(0xc420e764d0, 0x53966a0, 0xc4212efc00, 0xc421df2600)
A /usr/local/go/src/net/http/server.go:2568 +0x92
A net/http.(*conn).serve(0xc420b401e0, 0x5398c20, 0xc421e8fa80)
A /usr/local/go/src/net/http/server.go:1825 +0x612
A created by net/http.(*Server).Serve
A /usr/local/go/src/net/http/server.go:2668 +0x2ce
A logging error output: "Internal Error: failed to get imageFs stats: failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
also before that we've been seeing
image_gc_manager.go:285] [imageGCManager]: Disk usage on image filesystem is at 97% which is over the high threshold (85%). Trying to free 33740498534 bytes"
and
kubelet.go:1230] Image garbage collection failed multiple times in a row: rpc error: code = Unknown desc = Error response from daemon: layer does not exist"
as a result the kubelet was deleted
I Compute Engine delete us-central1-f:gke-prow-pool-n1-highmem-8-81ce4395-p87n {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","authenticationInfo":{},"requestMetadata":{"callerSuppliedUserAgent":"Managed Infrastructure Mixer Client"},"serviceName":"compute.googleapis.com","methodName":"v1.compute.instances.delete","authorizationInfo":[{"permission":"compute.instanc… Compute Engine delete us-central1-f:gke-prow-pool-n1-highmem-8-81ce4395-p87n
I compute.instances.delete {"actor":{"user":""},"resource":{"name":"gke-prow-pool-n1-highmem-8-81ce4395-p87n","type":"instance","zone":"us-central1-f","id":"115186519679285199"},"trace_id":"operation-1524591325162-56a9b945d2a10-5b42f21a-593c00f0","event_type":"GCE_API_CALL","operation":{"zone":"us-central1-f","id":"5706342505… compute.instances.delete
A I0424 17:35:26.380661 21999 docker_server.go:73] Stop docker server
discussed a bit offline with @mtaufen @krzyzacy, tentatively this is related to nodefs pressure and we should create a cronjob with the soft disk eviction hack :/
looked offline with @dashpole , seems the scalability pod was doing excessive disk I/Os towards the end of the run, which triggered image gc and borked the kubelet
@shyamjvs I attached a section of the log, can you let me know what was happening? I saw the job had --use-logexporter
enabled and it should not be pulling logs to the node?
E I0424 17:05:58.134] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-tfq1
E W0424 17:05:58.297] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:05:58.298] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:05:58.300] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:05:58.306] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:05:58.626]
E W0424 17:05:58.627] Specify --start=48966 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:05:58.702]
E W0424 17:05:58.702] Specify --start=47766 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:05:58.741]
E W0424 17:05:58.742] Specify --start=63367 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:05:59.015]
E W0424 17:05:59.015] Specify --start=46552 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:05:59.130]
E W0424 17:05:59.130] Specify --start=49075 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:05:59.338]
E W0424 17:05:59.339] Specify --start=48267 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:06:01.266] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:01.267] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:01.268] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:01.278] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:01.380] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:01.381] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:01.381] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:01.385] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:01.456] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:01.457] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:01.457] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:01.462] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:01.776] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:01.777] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:01.777] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:01.781] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.025] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.033] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.034] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.034] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.100] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.100] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.101] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.114] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.312] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.314] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.315] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.318] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.423] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.424] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.425] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.432] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.564] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.565] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.565] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.568] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.571] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.571] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.572] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.578] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.860] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.861] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.861] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.867] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:02.946] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:02.946] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:02.946] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:02.952] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:03.103]
E W0424 17:06:03.103] Specify --start=48858 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:06:03.722] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:03.722] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:03.723] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:03.727] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:03.786] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:03.787] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:03.787] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:03.791] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:03.846] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:03.847] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:03.848] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:03.856] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:04.212] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:04.213] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:04.214] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:04.219] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:04.434] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:04.434] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:04.434] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:04.439] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:04.953] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:04.953] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:04.954] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:04.961] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:05.007] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:05.007] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:05.008] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:05.013] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:05.122] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:05.123] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:05.123] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:05.126] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:05.207] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:05.208] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:05.208] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:05.212] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:05.412] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:05.412] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:05.412] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:05.416] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:05.706] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:05.706] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:05.707] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:05.711] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E W0424 17:06:06.312] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:06:06.313] scp: /var/log/node-problem-detector.log*: No such file or directory
E W0424 17:06:06.313] scp: /var/log/startupscript.log*: No such file or directory
E W0424 17:06:06.316] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
E I0424 17:07:12.954] Changing logfiles to be world-readable for download
E I0424 17:07:13.281] Changing logfiles to be world-readable for download
E I0424 17:07:13.727] Changing logfiles to be world-readable for download
E I0424 17:07:13.987] Changing logfiles to be world-readable for download
E I0424 17:07:14.548] Changing logfiles to be world-readable for download
E I0424 17:07:14.620] Changing logfiles to be world-readable for download
E I0424 17:07:14.747] Changing logfiles to be world-readable for download
E I0424 17:07:15.150] Changing logfiles to be world-readable for download
E I0424 17:07:15.359] Changing logfiles to be world-readable for download
E I0424 17:07:15.542] Changing logfiles to be world-readable for download
E I0424 17:07:16.040] Changing logfiles to be world-readable for download
E I0424 17:07:16.054] Changing logfiles to be world-readable for download
E I0424 17:07:16.077] Changing logfiles to be world-readable for download
E I0424 17:07:16.235] Changing logfiles to be world-readable for download
E I0424 17:07:16.469] Changing logfiles to be world-readable for download
E I0424 17:07:17.212] Changing logfiles to be world-readable for download
E I0424 17:07:17.641] Changing logfiles to be world-readable for download
E I0424 17:07:18.412] Changing logfiles to be world-readable for download
E I0424 17:07:20.040] Changing logfiles to be world-readable for download
E I0424 17:07:21.807] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v53m
E I0424 17:07:22.633] Changing logfiles to be world-readable for download
E I0424 17:07:22.815] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v35m
E I0424 17:07:23.000] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v4gv
E I0424 17:07:23.273] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v15n
E I0424 17:07:23.482] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vrr8
E I0424 17:07:23.734] Changing logfiles to be world-readable for download
E I0424 17:07:24.421] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v62x
E I0424 17:07:24.755] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vqqw
E I0424 17:07:24.833] Changing logfiles to be world-readable for download
E I0424 17:07:24.882] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v7cs
E I0424 17:07:24.061] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v3gm
E I0424 17:07:24.225] Changing logfiles to be world-readable for download
E I0424 17:07:25.102] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-tz13
E I0424 17:07:25.179] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v4kx
E W0424 17:07:25.952]
E W0424 17:07:25.952] Specify --start=48994 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:26.194] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vk2f
E I0424 17:07:26.265] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v2kc
E W0424 17:07:26.442]
E W0424 17:07:26.443] Specify --start=48753 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:26.525]
E W0424 17:07:26.525] Specify --start=47459 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:26.626] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v421
E I0424 17:07:26.679] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v6v0
E W0424 17:07:26.844]
E W0424 17:07:26.845] Specify --start=61849 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:26.946] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vb8s
E I0424 17:07:27.228] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vcbr
E W0424 17:07:27.329]
E W0424 17:07:27.330] Specify --start=49300 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:27.648] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vngc
E W0424 17:07:27.779]
E W0424 17:07:27.779] Specify --start=47253 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:28.335]
E W0424 17:07:28.335] Specify --start=47267 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:28.613]
E W0424 17:07:28.614] Specify --start=43707 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:28.951]
E W0424 17:07:28.951] Specify --start=48558 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:29.360] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vkqw
E W0424 17:07:29.214]
E W0424 17:07:29.214] Specify --start=48341 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:29.259]
E W0424 17:07:29.259] Specify --start=47588 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:30.039]
E W0424 17:07:30.040] Specify --start=46729 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:30.578]
E W0424 17:07:30.578] Specify --start=46668 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:30.993]
E W0424 17:07:30.993] Specify --start=65597 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:31.062]
E W0424 17:07:31.063] Specify --start=48455 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:31.078]
E W0424 17:07:31.079] Specify --start=47968 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:31.116]
E W0424 17:07:31.116] Specify --start=48249 in the next get-serial-port-output invocation to get only the new output starting from here.
E W0424 17:07:31.824]
E W0424 17:07:31.824] Specify --start=65007 in the next get-serial-port-output invocation to get only the new output starting from here.
E I0424 17:07:32.322] Changing logfiles to be world-readable for download
E W0424 17:07:32.423] scp: /var/log/fluentd.log*: No such file or directory
E W0424 17:07:32.424] scp: /var/log/node-problem-detector.log*: No such file or directory
@shyamjvs dig further - from the succeeded run 139, I found
I0414 00:17:00.069] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kqbs. Queuing it for logdump through SSH.
I0414 00:17:00.096] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kt1t. Queuing it for logdump through SSH.
I0414 00:17:00.117] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kzrd. Queuing it for logdump through SSH.
I0414 00:17:00.151] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l2s7. Queuing it for logdump through SSH.
I0414 00:17:00.158] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l2w6. Queuing it for logdump through SSH.
I0414 00:17:00.173] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l4gh. Queuing it for logdump through SSH.
I0414 00:17:00.180] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l508. Queuing it for logdump through SSH.
I0414 00:17:00.187] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l7n7. Queuing it for logdump through SSH.
I0414 00:17:00.201] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l8qg. Queuing it for logdump through SSH.
I0414 00:17:00.215] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lc37. Queuing it for logdump through SSH.
I0414 00:17:00.263] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-ll0j. Queuing it for logdump through SSH.
I0414 00:17:00.284] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lq4f. Queuing it for logdump through SSH.
I0414 00:17:00.291] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lqwr. Queuing it for logdump through SSH.
I0414 00:17:00.312] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lwdq. Queuing it for logdump through SSH.
I0414 00:17:00.326] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lxs3. Queuing it for logdump through SSH.
I0414 00:17:00.361] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m2cp. Queuing it for logdump through SSH.
I0414 00:17:00.368] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m3g8. Queuing it for logdump through SSH.
I0414 00:17:00.389] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m4d6. Queuing it for logdump through SSH.
I0414 00:17:00.397] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m5g5. Queuing it for logdump through SSH.
I0414 00:17:00.439] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-mk54. Queuing it for logdump through SSH.
...
namespace "logexporter" deleted
I think it's because logexporter is not functional and everything falls back to scp logdump which fills up the local disk.
@shyamjvs any update on the logexporter? This kind of failure also interrupts other jobs.
I've since added a little bit more disk to the nodes when normalizing the CI machines which may help, but we should fix this..
Sorry I missed this thread earlier. I remember speaking about this with @krzyzacy offline.
There isn't really anything unusual with logexporter here FMU. In general, in every run there are about few tens of nodes among the 5k nodes that fail to run/finish logexporter on time. For such nodes, we fallback to our original way of copying over logs through SSH. And that shouldn't (and wasn't) a problem as long as not too many of nodes fails (by too many I mean hundreds).
For e.g, look at this run which was perfectly fine - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/139/build-log.txt It still have several tens of nodes where logexporter didn't succeed, but that didn't really cause any pod eviction, etc.
In short what I'm saying is those logexporter logs are a red-herring and happen quite commonly. It's most likely unrelated to this issue (unless we find some evidence otherwise).
are we deleting these after copying them to the CI machine to upload? even transient disk space is one of the resources we can't allocate to jobs and need to keep under control. it has been problematic in the past.
@shyamjvs xref https://github.com/kubernetes/kubernetes/issues/63030#issuecomment-394672476 here, the scheduling looks fine now from http://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance, but I think the issue is still there
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
This happened for our scalability job - https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance See runs 143 (which is the right one to be created as per cron) and 144 (created at wrong time while 143 was happening). Maybe something bad happened with the earlier run's pod.
@krzyzacy @BenTheElder - Could you PTAL or triage?