jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Report ElasticsearchCluster health status to API #195

Closed munnerz closed 6 years ago

munnerz commented 6 years ago

What this PR does / why we need it:

Adds support for recording the health status of the Elasticsearch cluster back to the API.

This also introduces a 10s collection delay to the collection of node metrics for Elasticsearch.

Which issue this PR fixes: fixes #191

Release note:

Record overall Elasticsearch cluster health as status on the ElasticsearchCluster resource
jetstack-ci-bot commented 6 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: We suggest the following additional approver: munnerz

Assign the PR to them by writing /assign @munnerz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files: - **[OWNERS](https://github.com/jetstack/navigator/blob/master/OWNERS)** You can indicate your approval by writing `/approve` in a comment You can cancel your approval by writing `/approve cancel` in a comment
munnerz commented 6 years ago

/test e2e

munnerz commented 6 years ago

1) 👍 2) The 'scheduledWorkQueue' takes care of this. It ensures that 'member pilots' (i.e. pilots that are a member of the same cluster) will be synced every ~10s regardless of resourceVersion etc. 3) Same as above 4) Same as above 5) It would be great if it did have this (a watch mechanism). I think there is something like this in the client library we are using, but under the hood it's just polling afaik. This falls under improvements for the future imo however. 6) How do we justify adding this health status field, while at the same time reducing the API surface area in #205 ? <- not sure what you mean here. We're looking to simplify, not reduce the API surface. By no means does that mean we should halt future development, more just take count of what we have so far. 7) It will be used by Navigator to make decisions around scale up, but the information can also be surfaced to users (e.g. through kubectl get in future):

$ kc get esc
NAME               READY     HEALTH   MASTERS   INGEST   DATA
test-cluster       5/5       Green    3         2        2

8) Yes, it would be used to feed into things such as the update operator, or deciding whether to scale up/down. The delay here is particularly difficult, however I think we're always going to have to suffer some delay unless we fully block/shut down the database. Instead of aiming to get a perfect in time snapshot of the system, I'm trying to go down the route of expecting things to be out of date, but 'tolerating' a certain upper bound of late data (e.g. I will accept this as the truth if it was reported within the last 5s).

9) Regarding how long something has been green/yellow/red for, absolutely agree we need this information and it will be coming in a follow up. This is effectively the heartbeat. I'll be adding this further down the line for ESC :)

Thanks for the review Richard! Hopefully that clears up a few Qs. I'll get an e2e test together today.

munnerz commented 6 years ago

/test e2e

jetstack-ci-bot commented 6 years ago

/test all [submit-queue is verifying that this PR is safe to merge]

munnerz commented 6 years ago

argh. failed on some stupid networking flake...

/test e2e v1.9

jetstack-ci-bot commented 6 years ago

Automatic merge from submit-queue.