Closed munnerz closed 6 years ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: We suggest the following additional approver: munnerz
Assign the PR to them by writing /assign @munnerz
in a comment when ready.
The full list of commands accepted by this bot can be found here.
/test e2e
1) 👍
2) The 'scheduledWorkQueue' takes care of this. It ensures that 'member pilots' (i.e. pilots that are a member of the same cluster) will be synced every ~10s regardless of resourceVersion etc.
3) Same as above
4) Same as above
5) It would be great if it did have this (a watch mechanism). I think there is something like this in the client library we are using, but under the hood it's just polling afaik. This falls under improvements for the future imo however.
6) How do we justify adding this health status field, while at the same time reducing the API surface area in #205 ?
<- not sure what you mean here. We're looking to simplify, not reduce the API surface. By no means does that mean we should halt future development, more just take count of what we have so far.
7) It will be used by Navigator to make decisions around scale up, but the information can also be surfaced to users (e.g. through kubectl get
in future):
$ kc get esc
NAME READY HEALTH MASTERS INGEST DATA
test-cluster 5/5 Green 3 2 2
8) Yes, it would be used to feed into things such as the update operator, or deciding whether to scale up/down. The delay here is particularly difficult, however I think we're always going to have to suffer some delay unless we fully block/shut down the database. Instead of aiming to get a perfect in time snapshot of the system, I'm trying to go down the route of expecting things to be out of date, but 'tolerating' a certain upper bound of late data (e.g. I will accept this as the truth if it was reported within the last 5s).
9) Regarding how long something has been green/yellow/red for, absolutely agree we need this information and it will be coming in a follow up. This is effectively the heartbeat. I'll be adding this further down the line for ESC :)
Thanks for the review Richard! Hopefully that clears up a few Qs. I'll get an e2e test together today.
/test e2e
/test all [submit-queue is verifying that this PR is safe to merge]
argh. failed on some stupid networking flake...
/test e2e v1.9
Automatic merge from submit-queue.
What this PR does / why we need it:
Adds support for recording the health status of the Elasticsearch cluster back to the API.
This also introduces a 10s collection delay to the collection of node metrics for Elasticsearch.
Which issue this PR fixes: fixes #191
Release note: