Remove broken host map subset assertions from nodetool

wallrj commented 6 years ago

Fixes: #331

Release note:

NONE

wallrj commented 6 years ago

/retest

wallrj commented 6 years ago

I'm not 100% sure on how this works, so am not able to effectively review it.

I've added a docstring to the function as you suggested.

It'd be great to also see an e2e test added with this that tests the scenario described in #331. I'm apprehensive to close the issue without knowing for sure it's resolved smile

I doubt I'm going to be able to write an E2E test for this particular failure. The error was caused (as far as I can tell) because when a new C node starts up and connects to a seed node, it is seen by the rest of the nodes in the cluster as a live* node. But it has not necessarily yet generated and gossiped its host_id. And my original version of the golang nodetool status function assumed that all the nodes whose state and status are known should also therefore be present in the host_id map.

wallrj commented 6 years ago

/retest

munnerz commented 6 years ago

I doubt I'm going to be able to write an E2E test for this particular failure.

Would simply ensuring we can add a new node to an existing cluster without any of the other nodes failing their readiness probes test this case?

wallrj commented 6 years ago

/retest

wallrj commented 6 years ago

Ok. Looks like I triggered the failure:

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/jetstack_navigator/333/navigator-e2e-v1-8/1183/

I0427 15:03:40.431] 2018-04-27 15:02:22 +0000 UTC   2018-04-27 15:02:22 +0000 UTC   1         cass-test-np-region-1-zone-a-0.1529531f54f60de4   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 432f5d7f-4a29-11e8-a8d5-0a580a1c0002   Liveness probe failed: HTTP probe failed with statuscode: 500

Unfortunately, the pilot logs aren't available because the container we restarted by the subsequent liveness probe test.

Perhaps I'll change the tests to exit after the first failure.

wallrj commented 6 years ago

Here we go. The test failed and the logs contain the nodetool error

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/jetstack_navigator/333/navigator-e2e-v1-7/1171/

W0427 15:58:40.008] + echo 'TEST FAILURE: original pods were unhealthy during the scale out'
W0427 15:58:40.008] + exit 1
W0427 15:58:40.008] + dump_debug_logs /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + local output_dir=/go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + echo 'Dumping cluster state to /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs'
W0427 15:58:40.008] + mkdir -p /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.009] + kubectl cluster-info dump --all-namespaces --output-directory /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
I0427 15:58:40.110] Checking original pods for 'Unhealthy' events during scale out...
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC   2018-04-27 15:57:17 +0000 UTC   1         cass-test-np-region-1-zone-a-0   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c   Liveness probe failed: HTTP probe failed with statuscode: 500
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC   2018-04-27 15:57:17 +0000 UTC   1         cass-test-np-region-1-zone-a-0   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c   Readiness probe failed: HTTP probe failed with statuscode: 500

logs.txt

E0427 15:57:17.727230      15 listen.go:21] Error while running Check function for probe on port 12001: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]
E0427 15:57:17.738374      15 listen.go:21] Error while running Check function for probe on port 12000: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]

It only failed on the 1.7 cluster.

Now I'll commit the fix and expect the tests to pass consistently.

munnerz commented 6 years ago

/lgtm /approve

jetstack-bot commented 6 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/jetstack/navigator/blob/master/OWNERS)~~ [munnerz] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

jetstack / navigator

Remove broken host map subset assertions from nodetool #333