Prow created another run while current one under progress

shyamjvs commented 6 years ago

This happened for our scalability job - https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance See runs 143 (which is the right one to be created as per cron) and 144 (created at wrong time while 143 was happening). Maybe something bad happened with the earlier run's pod.

@krzyzacy @BenTheElder - Could you PTAL or triage?

BenTheElder commented 6 years ago

assigning oncall, will be in meetings mostly until this afternoon. /assign @Q-Lee

krzyzacy commented 6 years ago

evicted?!

krzyzacy commented 6 years ago

It just happened again:

looks like upload artifacts somehow killed the pod

krzyzacy commented 6 years ago

It's deleted by the garbage collector and recreated by plank.. although I cannot find anything from the kubelet log..

I have no knowledge around how garbage collector works? cc @dashpole if you know what can be happening here.. prow is at 1.8.7 now

krzyzacy commented 6 years ago

Further info: the node gke-prow-pool-n1-highmem-8-81ce4395-p87n was gc'd and rebooted, still looking into why

A  I0424 17:34:08.758639   21999 server.go:779] GET /healthz: (28.755µs) 200 [[Go-http-client/1.1] 127.0.0.1:48934] 
A  I0424 17:34:09.606962   21999 server.go:779] GET /healthz: (25.299µs) 200 [[curl/7.56.1] 127.0.0.1:48946] 
A  I0424 17:34:19.614586   21999 server.go:779] GET /healthz: (40.083µs) 200 [[curl/7.56.1] 127.0.0.1:49064] 
A  I0424 17:34:29.622445   21999 server.go:779] GET /healthz: (31.071µs) 200 [[curl/7.56.1] 127.0.0.1:49174] 
A  I0424 17:34:39.629337   21999 server.go:779] GET /healthz: (24.758µs) 200 [[curl/7.56.1] 127.0.0.1:49292] 
A  I0424 17:34:44.296633   21999 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration 
A  I0424 17:34:49.636328   21999 server.go:779] GET /healthz: (30.911µs) 200 [[curl/7.56.1] 127.0.0.1:49426] 
A  I0424 17:34:59.643691   21999 server.go:779] GET /healthz: (31.338µs) 200 [[curl/7.56.1] 127.0.0.1:49534] 
A  E0424 17:35:05.235789   21999 remote_image.go:67] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
A  E0424 17:35:05.236329   21999 kuberuntime_image.go:140] ListImages failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
A  E0424 17:35:05.236641   21999 handler.go:276] HTTP InternalServerError serving /stats/summary: Internal Error: failed to get imageFs stats: failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
A  I0424 17:35:05.236862   21999 server.go:779] GET /stats/summary/: (2m0.002025429s) 500 
A  goroutine 2088134 [running]: 
A  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc421529570, 0x1f4) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd 
A  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc421529570, 0x1f4) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35 
A  k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(0xc422840120, 0x1f4) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:201 +0x41 
A  k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc422840120, 0x1f4, 0xc4216aecf0, 0x8b, 0x2, 0xc4216aecf0) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:181 +0x46 
A  k8s.io/kubernetes/pkg/kubelet/server/stats.handleError(0xc422840120, 0x37a96d7, 0xe, 0x536b920, 0xc422dc8ce0) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:277 +0x1da 
A  k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).handleSummary(0xc420972aa0, 0xc422b05d40, 0xc422840120) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:188 +0x79 
A  k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).(k8s.io/kubernetes/pkg/kubelet/server/stats.handleSummary)-fm(0xc422b05d40, 0xc422840120) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:100 +0x3e 
A  k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc4206be510, 0x5395de0, 0xc421529570, 0xc421df2600) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:277 +0xb8d 
A  k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm(0x5395de0, 0xc421529570, 0xc421df2600) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:120 +0x48 
A  net/http.HandlerFunc.ServeHTTP(0xc420465940, 0x5395de0, 0xc421529570, 0xc421df2600) 
A   /usr/local/go/src/net/http/server.go:1942 +0x44 
A  net/http.(*ServeMux).ServeHTTP(0xc4203e3020, 0x5395de0, 0xc421529570, 0xc421df2600) 
A   /usr/local/go/src/net/http/server.go:2238 +0x130 
A  k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).ServeHTTP(0xc4206be510, 0x5395de0, 0xc421529570, 0xc421df2600) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:292 +0x4d 
A  k8s.io/kubernetes/pkg/kubelet/server.(*Server).ServeHTTP(0xc4209faaf0, 0x5395de0, 0xc421529570, 0xc421df2600) 
A   /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:778 +0x110 
A  net/http.serverHandler.ServeHTTP(0xc420e764d0, 0x53966a0, 0xc4212efc00, 0xc421df2600) 
A   /usr/local/go/src/net/http/server.go:2568 +0x92 
A  net/http.(*conn).serve(0xc420b401e0, 0x5398c20, 0xc421e8fa80) 
A   /usr/local/go/src/net/http/server.go:1825 +0x612 
A  created by net/http.(*Server).Serve 
A   /usr/local/go/src/net/http/server.go:2668 +0x2ce 
A  logging error output: "Internal Error: failed to get imageFs stats: failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

also before that we've been seeing

image_gc_manager.go:285] [imageGCManager]: Disk usage on image filesystem is at 97% which is over the high threshold (85%). Trying to free 33740498534 bytes"

and

kubelet.go:1230] Image garbage collection failed multiple times in a row: rpc error: code = Unknown desc = Error response from daemon: layer does not exist"

as a result the kubelet was deleted

I  Compute Engine delete us-central1-f:gke-prow-pool-n1-highmem-8-81ce4395-p87n {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","authenticationInfo":{},"requestMetadata":{"callerSuppliedUserAgent":"Managed Infrastructure Mixer Client"},"serviceName":"compute.googleapis.com","methodName":"v1.compute.instances.delete","authorizationInfo":[{"permission":"compute.instanc… Compute Engine delete us-central1-f:gke-prow-pool-n1-highmem-8-81ce4395-p87n  
I  compute.instances.delete {"actor":{"user":""},"resource":{"name":"gke-prow-pool-n1-highmem-8-81ce4395-p87n","type":"instance","zone":"us-central1-f","id":"115186519679285199"},"trace_id":"operation-1524591325162-56a9b945d2a10-5b42f21a-593c00f0","event_type":"GCE_API_CALL","operation":{"zone":"us-central1-f","id":"5706342505… compute.instances.delete 
A  I0424 17:35:26.380661   21999 docker_server.go:73] Stop docker server

BenTheElder commented 6 years ago

discussed a bit offline with @mtaufen @krzyzacy, tentatively this is related to nodefs pressure and we should create a cronjob with the soft disk eviction hack :/

krzyzacy commented 6 years ago

looked offline with @dashpole , seems the scalability pod was doing excessive disk I/Os towards the end of the run, which triggered image gc and borked the kubelet

@shyamjvs I attached a section of the log, can you let me know what was happening? I saw the job had --use-logexporter enabled and it should not be pulling logs to the node?

E  I0424 17:05:58.134] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-tfq1

E  W0424 17:05:58.297] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:05:58.298] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:05:58.300] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:05:58.306] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:05:58.626] 

E  W0424 17:05:58.627] Specify --start=48966 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:05:58.702] 

E  W0424 17:05:58.702] Specify --start=47766 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:05:58.741] 

E  W0424 17:05:58.742] Specify --start=63367 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:05:59.015] 

E  W0424 17:05:59.015] Specify --start=46552 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:05:59.130] 

E  W0424 17:05:59.130] Specify --start=49075 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:05:59.338] 

E  W0424 17:05:59.339] Specify --start=48267 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:06:01.266] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:01.267] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:01.268] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:01.278] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:01.380] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:01.381] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:01.381] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:01.385] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:01.456] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:01.457] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:01.457] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:01.462] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:01.776] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:01.777] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:01.777] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:01.781] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.025] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.033] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.034] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.034] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.100] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.100] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.101] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.114] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.312] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.314] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.315] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.318] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.423] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.424] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.425] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.432] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.564] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.565] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.565] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.568] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.571] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.571] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.572] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.578] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.860] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.861] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.861] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.867] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:02.946] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:02.946] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:02.946] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:02.952] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:03.103] 

E  W0424 17:06:03.103] Specify --start=48858 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:06:03.722] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:03.722] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:03.723] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:03.727] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:03.786] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:03.787] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:03.787] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:03.791] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:03.846] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:03.847] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:03.848] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:03.856] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:04.212] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:04.213] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:04.214] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:04.219] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:04.434] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:04.434] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:04.434] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:04.439] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:04.953] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:04.953] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:04.954] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:04.961] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:05.007] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:05.007] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:05.008] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:05.013] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:05.122] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:05.123] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:05.123] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:05.126] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:05.207] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:05.208] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:05.208] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:05.212] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:05.412] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:05.412] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:05.412] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:05.416] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:05.706] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:05.706] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:05.707] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:05.711] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  W0424 17:06:06.312] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:06:06.313] scp: /var/log/node-problem-detector.log*: No such file or directory

E  W0424 17:06:06.313] scp: /var/log/startupscript.log*: No such file or directory

E  W0424 17:06:06.316] ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

E  I0424 17:07:12.954] Changing logfiles to be world-readable for download

E  I0424 17:07:13.281] Changing logfiles to be world-readable for download

E  I0424 17:07:13.727] Changing logfiles to be world-readable for download

E  I0424 17:07:13.987] Changing logfiles to be world-readable for download

E  I0424 17:07:14.548] Changing logfiles to be world-readable for download

E  I0424 17:07:14.620] Changing logfiles to be world-readable for download

E  I0424 17:07:14.747] Changing logfiles to be world-readable for download

E  I0424 17:07:15.150] Changing logfiles to be world-readable for download

E  I0424 17:07:15.359] Changing logfiles to be world-readable for download

E  I0424 17:07:15.542] Changing logfiles to be world-readable for download

E  I0424 17:07:16.040] Changing logfiles to be world-readable for download

E  I0424 17:07:16.054] Changing logfiles to be world-readable for download

E  I0424 17:07:16.077] Changing logfiles to be world-readable for download

E  I0424 17:07:16.235] Changing logfiles to be world-readable for download

E  I0424 17:07:16.469] Changing logfiles to be world-readable for download

E  I0424 17:07:17.212] Changing logfiles to be world-readable for download

E  I0424 17:07:17.641] Changing logfiles to be world-readable for download

E  I0424 17:07:18.412] Changing logfiles to be world-readable for download

E  I0424 17:07:20.040] Changing logfiles to be world-readable for download

E  I0424 17:07:21.807] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v53m

E  I0424 17:07:22.633] Changing logfiles to be world-readable for download

E  I0424 17:07:22.815] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v35m

E  I0424 17:07:23.000] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v4gv

E  I0424 17:07:23.273] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v15n

E  I0424 17:07:23.482] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vrr8

E  I0424 17:07:23.734] Changing logfiles to be world-readable for download

E  I0424 17:07:24.421] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v62x

E  I0424 17:07:24.755] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vqqw

E  I0424 17:07:24.833] Changing logfiles to be world-readable for download

E  I0424 17:07:24.882] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v7cs

E  I0424 17:07:24.061] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v3gm

E  I0424 17:07:24.225] Changing logfiles to be world-readable for download

E  I0424 17:07:25.102] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-tz13

E  I0424 17:07:25.179] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v4kx

E  W0424 17:07:25.952] 

E  W0424 17:07:25.952] Specify --start=48994 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:26.194] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vk2f

E  I0424 17:07:26.265] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v2kc

E  W0424 17:07:26.442] 

E  W0424 17:07:26.443] Specify --start=48753 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:26.525] 

E  W0424 17:07:26.525] Specify --start=47459 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:26.626] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v421

E  I0424 17:07:26.679] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-v6v0

E  W0424 17:07:26.844] 

E  W0424 17:07:26.845] Specify --start=61849 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:26.946] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vb8s

E  I0424 17:07:27.228] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vcbr

E  W0424 17:07:27.329] 

E  W0424 17:07:27.330] Specify --start=49300 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:27.648] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vngc

E  W0424 17:07:27.779] 

E  W0424 17:07:27.779] Specify --start=47253 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:28.335] 

E  W0424 17:07:28.335] Specify --start=47267 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:28.613] 

E  W0424 17:07:28.614] Specify --start=43707 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:28.951] 

E  W0424 17:07:28.951] Specify --start=48558 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:29.360] Copying 'kube-proxy fluentd node-problem-detector startupscript' from gce-scale-cluster-minion-group-1-vkqw

E  W0424 17:07:29.214] 

E  W0424 17:07:29.214] Specify --start=48341 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:29.259] 

E  W0424 17:07:29.259] Specify --start=47588 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:30.039] 

E  W0424 17:07:30.040] Specify --start=46729 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:30.578] 

E  W0424 17:07:30.578] Specify --start=46668 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:30.993] 

E  W0424 17:07:30.993] Specify --start=65597 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:31.062] 

E  W0424 17:07:31.063] Specify --start=48455 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:31.078] 

E  W0424 17:07:31.079] Specify --start=47968 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:31.116] 

E  W0424 17:07:31.116] Specify --start=48249 in the next get-serial-port-output invocation to get only the new output starting from here.

E  W0424 17:07:31.824] 

E  W0424 17:07:31.824] Specify --start=65007 in the next get-serial-port-output invocation to get only the new output starting from here.

E  I0424 17:07:32.322] Changing logfiles to be world-readable for download

E  W0424 17:07:32.423] scp: /var/log/fluentd.log*: No such file or directory

E  W0424 17:07:32.424] scp: /var/log/node-problem-detector.log*: No such file or directory

krzyzacy commented 6 years ago

@shyamjvs dig further - from the succeeded run 139, I found

I0414 00:17:00.069] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kqbs. Queuing it for logdump through SSH.
I0414 00:17:00.096] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kt1t. Queuing it for logdump through SSH.
I0414 00:17:00.117] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-kzrd. Queuing it for logdump through SSH.
I0414 00:17:00.151] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l2s7. Queuing it for logdump through SSH.
I0414 00:17:00.158] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l2w6. Queuing it for logdump through SSH.
I0414 00:17:00.173] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l4gh. Queuing it for logdump through SSH.
I0414 00:17:00.180] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l508. Queuing it for logdump through SSH.
I0414 00:17:00.187] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l7n7. Queuing it for logdump through SSH.
I0414 00:17:00.201] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-l8qg. Queuing it for logdump through SSH.
I0414 00:17:00.215] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lc37. Queuing it for logdump through SSH.
I0414 00:17:00.263] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-ll0j. Queuing it for logdump through SSH.
I0414 00:17:00.284] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lq4f. Queuing it for logdump through SSH.
I0414 00:17:00.291] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lqwr. Queuing it for logdump through SSH.
I0414 00:17:00.312] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lwdq. Queuing it for logdump through SSH.
I0414 00:17:00.326] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-lxs3. Queuing it for logdump through SSH.
I0414 00:17:00.361] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m2cp. Queuing it for logdump through SSH.
I0414 00:17:00.368] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m3g8. Queuing it for logdump through SSH.
I0414 00:17:00.389] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m4d6. Queuing it for logdump through SSH.
I0414 00:17:00.397] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-m5g5. Queuing it for logdump through SSH.
I0414 00:17:00.439] Logexporter didn't succeed on node gce-scale-cluster-minion-group-4-mk54. Queuing it for logdump through SSH.
...
namespace "logexporter" deleted

I think it's because logexporter is not functional and everything falls back to scp logdump which fills up the local disk.

BenTheElder commented 6 years ago

@shyamjvs any update on the logexporter? This kind of failure also interrupts other jobs.

BenTheElder commented 6 years ago

I've since added a little bit more disk to the nodes when normalizing the CI machines which may help, but we should fix this..

shyamjvs commented 6 years ago

Sorry I missed this thread earlier. I remember speaking about this with @krzyzacy offline.

There isn't really anything unusual with logexporter here FMU. In general, in every run there are about few tens of nodes among the 5k nodes that fail to run/finish logexporter on time. For such nodes, we fallback to our original way of copying over logs through SSH. And that shouldn't (and wasn't) a problem as long as not too many of nodes fails (by too many I mean hundreds).

For e.g, look at this run which was perfectly fine - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/139/build-log.txt It still have several tens of nodes where logexporter didn't succeed, but that didn't really cause any pod eviction, etc.

shyamjvs commented 6 years ago

In short what I'm saying is those logexporter logs are a red-herring and happen quite commonly. It's most likely unrelated to this issue (unless we find some evidence otherwise).

BenTheElder commented 6 years ago

are we deleting these after copying them to the CI machine to upload? even transient disk space is one of the resources we can't allocate to jobs and need to keep under control. it has been problematic in the past.

krzyzacy commented 6 years ago

@shyamjvs xref https://github.com/kubernetes/kubernetes/issues/63030#issuecomment-394672476 here, the scheduling looks fine now from http://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance, but I think the issue is still there

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 6 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/7841#issuecomment-435551863): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

Prow created another run while current one under progress #7841