Open mm4tt opened 4 years ago
Looks like it transiently fails in 1.16, meaning that some of the ssh calls succeed and some not (within a single run), e.g.
OK - W0129 12:36:05.537] I0129 12:36:05.536771 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\nWarning: Permanently added 'compute.691072012589517573' (ED25519) to the list of known hosts.\r\nWarning: Stopping docker.service, but it can still be activated by:\n docker.socket\n": <nil>
BAD - W0129 12:46:08.636] I0129 12:46:08.636013 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255
I checked 3 runs of 1.17 test and the problem doesn't occur there. Seems to be 1.16 specific thing.
Maybe there is a different gcloud version used in 1.16 and 1.17?
I'd try upgrading the gcloud version in 1.16 test to see whether it helps
/assign
https://github.com/kubernetes/test-infra/pull/16103 doesn't seem to be helping, let's revert it.
I took a deeper look a have a new theory now. It looks like in 1.17 runs there are no logs from chaosmonkey components. I believe that the error we see in 1.16 are actually expected, they are returned for reboot command which terminates the ssh connection. We don't see them in 1.17 because chaosmonkey doesn't work properly there for some reason. The thing that stands out is that in 1.17 we have this commit and we don't have it in 1.16. This commit is also present in master, and there we also don't have any chaosmonkey logs.
I'd suggest adding more logging to nodes.go in master branch to see what is going on with the chasomonkey there.
/good-first-issue
@mm4tt: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
FTR, these are the chaosmonkey files that we could instrument better - https://github.com/kubernetes/perf-tests/tree/eb4fffb50d3caee11a57262b46286f051d9337fb/clusterloader2/pkg/chaos Adding more verbose logging there (e.g. listing all the nodes chaosmonkey is operating on, logging when chaosmonkey attempts to kill a node, etc.) should help us debug this issue.
/assign
There are two different issues:
255
to SSH is just a non-gracefully closed connection to a node after reboot is executed.Fixed.
/close
@jprzychodzen: Closing this issue.
@mm4tt: Reopened this issue.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /lifecycle frozen
Original debugging done by @jkaniuk: