NodeKiller seems to be not working in 100 node 1.17 / master performance tests

mm4tt commented 4 years ago

Original debugging done by @jkaniuk:

In 100 nodes OSS performance tests of 1.16:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-1.16-scalability-100

NodeKiller is consistently failing: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1219228169567997957
W0120 12:25:21.234] I0120 12:25:21.234558   12979 nodes.go:105] NodeKiller: Rebooting "e2e-big-minion-group-tt6r" to repair the node
W0120 12:25:24.556] I0120 12:25:24.555774   12979 ssh.go:38] ssh to "e2e-big-minion-group-tt6r" finished with "External IP address was not found; defaulting to using IAP > tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255
W0120 12:25:24.556] E0120 12:25:24.555839   12979 nodes.go:108] NodeKiller: Error while rebooting node "e2e-big-minion-group-tt6r": exit status 255

mm4tt commented 4 years ago

Looks like it transiently fails in 1.16, meaning that some of the ssh calls succeed and some not (within a single run), e.g.

OK - W0129 12:36:05.537] I0129 12:36:05.536771 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\nWarning: Permanently added 'compute.691072012589517573' (ED25519) to the list of known hosts.\r\nWarning: Stopping docker.service, but it can still be activated by:\n docker.socket\n": <nil>

BAD - W0129 12:46:08.636] I0129 12:46:08.636013 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255

mm4tt commented 4 years ago

I checked 3 runs of 1.17 test and the problem doesn't occur there. Seems to be 1.16 specific thing.

Maybe there is a different gcloud version used in 1.16 and 1.17?

mm4tt commented 4 years ago

I'd try upgrading the gcloud version in 1.16 test to see whether it helps

mm4tt commented 4 years ago

/assign

mm4tt commented 4 years ago

https://github.com/kubernetes/test-infra/pull/16103 doesn't seem to be helping, let's revert it.

I took a deeper look a have a new theory now. It looks like in 1.17 runs there are no logs from chaosmonkey components. I believe that the error we see in 1.16 are actually expected, they are returned for reboot command which terminates the ssh connection. We don't see them in 1.17 because chaosmonkey doesn't work properly there for some reason. The thing that stands out is that in 1.17 we have this commit and we don't have it in 1.16. This commit is also present in master, and there we also don't have any chaosmonkey logs.

I'd suggest adding more logging to nodes.go in master branch to see what is going on with the chasomonkey there.

mm4tt commented 4 years ago

/good-first-issue

k8s-ci-robot commented 4 years ago

@mm4tt: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubernetes/perf-tests/issues/1005): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mm4tt commented 4 years ago

FTR, these are the chaosmonkey files that we could instrument better - https://github.com/kubernetes/perf-tests/tree/eb4fffb50d3caee11a57262b46286f051d9337fb/clusterloader2/pkg/chaos Adding more verbose logging there (e.g. listing all the nodes chaosmonkey is operating on, logging when chaosmonkey attempts to kill a node, etc.) should help us debug this issue.

jprzychodzen commented 4 years ago

/assign

jprzychodzen commented 4 years ago

There are two different issues:

Nodes are not really randomly selected. Current selecting mechanism with current configuration does not select any node to kill if ( failure rate ) * ( number of eligible nodes ) < 1. While filtering out nodes running Prometheus we are getting less than 100 nodes. Current failure rate assume that 0.01 nodes will fail. This multiplies to a number lower than one and never selects any node to kill.
Response 255 to SSH is just a non-gracefully closed connection to a node after reboot is executed.

jprzychodzen commented 4 years ago

Fixed.

jprzychodzen commented 4 years ago

/close

k8s-ci-robot commented 4 years ago

@jprzychodzen: Closing this issue.

In response to [this](https://github.com/kubernetes/perf-tests/issues/1005#issuecomment-597015482): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mm4tt commented 4 years ago

/reopen

https://github.com/kubernetes/perf-tests/pull/1140

k8s-ci-robot commented 4 years ago

@mm4tt: Reopened this issue.

In response to [this](https://github.com/kubernetes/perf-tests/issues/1005#issuecomment-605995401): >/reopen > >https://github.com/kubernetes/perf-tests/pull/1140 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

jkaniuk commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wojtek-t commented 4 years ago

/remove-lifecycle stale /lifecycle frozen

kubernetes / perf-tests

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005