Closed wallrj closed 6 years ago
/retest
I'm not 100% sure on how this works, so am not able to effectively review it.
I've added a docstring to the function as you suggested.
It'd be great to also see an e2e test added with this that tests the scenario described in #331. I'm apprehensive to close the issue without knowing for sure it's resolved smile
I doubt I'm going to be able to write an E2E test for this particular failure. The error was caused (as far as I can tell) because when a new C node starts up and connects to a seed node, it is seen by the rest of the nodes in the cluster as a live* node. But it has not necessarily yet generated and gossiped its host_id. And my original version of the golang nodetool status function assumed that all the nodes whose state and status are known should also therefore be present in the host_id map.
/retest
I doubt I'm going to be able to write an E2E test for this particular failure.
Would simply ensuring we can add a new node to an existing cluster without any of the other nodes failing their readiness probes test this case?
/retest
Ok. Looks like I triggered the failure:
I0427 15:03:40.431] 2018-04-27 15:02:22 +0000 UTC 2018-04-27 15:02:22 +0000 UTC 1 cass-test-np-region-1-zone-a-0.1529531f54f60de4 Pod spec.containers{cassandra} Warning Unhealthy kubelet, 432f5d7f-4a29-11e8-a8d5-0a580a1c0002 Liveness probe failed: HTTP probe failed with statuscode: 500
Unfortunately, the pilot logs aren't available because the container we restarted by the subsequent liveness probe test.
Perhaps I'll change the tests to exit after the first failure.
Here we go. The test failed and the logs contain the nodetool error
W0427 15:58:40.008] + echo 'TEST FAILURE: original pods were unhealthy during the scale out'
W0427 15:58:40.008] + exit 1
W0427 15:58:40.008] + dump_debug_logs /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + local output_dir=/go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + echo 'Dumping cluster state to /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs'
W0427 15:58:40.008] + mkdir -p /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.009] + kubectl cluster-info dump --all-namespaces --output-directory /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
I0427 15:58:40.110] Checking original pods for 'Unhealthy' events during scale out...
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC 2018-04-27 15:57:17 +0000 UTC 1 cass-test-np-region-1-zone-a-0 Pod spec.containers{cassandra} Warning Unhealthy kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c Liveness probe failed: HTTP probe failed with statuscode: 500
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC 2018-04-27 15:57:17 +0000 UTC 1 cass-test-np-region-1-zone-a-0 Pod spec.containers{cassandra} Warning Unhealthy kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c Readiness probe failed: HTTP probe failed with statuscode: 500
E0427 15:57:17.727230 15 listen.go:21] Error while running Check function for probe on port 12001: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]
E0427 15:57:17.738374 15 listen.go:21] Error while running Check function for probe on port 12000: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]
It only failed on the 1.7 cluster.
Now I'll commit the fix and expect the tests to pass consistently.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: munnerz
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Fixes: #331
Release note: