Status / health of workload clusters

othylmann commented 6 years ago

Epic Story

As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.

(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)

Linked UserStories

Competitor Analysis giantswarm/giantswarm#6136
Customer Focus Sessions/ Feedback giantswarm/giantswarm#6137
Available Data giantswarm/giantswarm#6138
Mock ups giantswarm/giantswarm#6139
Architectural implications giantswarm/giantswarm#6312
In scope/ out of scope

User Personas

Linked Stories

Statuspage.io giantswarm/giantswarm#2966
How to Define Cluster Health giantswarm/giantswarm#3221
Overview of Cluster State and Status giantswarm/giantswarm#3223
Transparency regarding cluster state: giantswarm/giantswarm#2209
Real-time event stream: giantswarm/giantswarm#3579
Expose workload cluster metrics: giantswarm/giantswarm#1959
History of a cluster: giantswarm/giantswarm#2197
Other related tickets https://github.com/orgs/giantswarm/projects/41?fullscreen=true&card_filter_query=label%3Aarea%2Fobservability

othylmann commented 6 years ago

Currently it is linked the cluster transparency topic in the teams.

othylmann commented 6 years ago

@giantswarm/sig-product I added that to the product board index, as I am not sure where to ask for a possible meeting to get the transparency, / health / history stuff all in one spec.

teemow commented 6 years ago

The roadmap stories for that are currently:

Transparency regarding cluster state: https://github.com/giantswarm/giantswarm/issues/2209
Realtime event stream: https://github.com/giantswarm/giantswarm/issues/3579
Expose tenant cluster metrics: https://github.com/giantswarm/giantswarm/issues/1959
History of a cluster: https://github.com/giantswarm/giantswarm/issues/2197

See them in the roadmap board: https://github.com/orgs/giantswarm/projects/41?fullscreen=true&card_filter_query=label%3Aarea%2Fobservability

As most of them are kind of technical I am happy to work on a story that ignores the technical solutions and describes the needs of the customers.

othylmann commented 6 years ago

The question is just at the moment how to do something like that. For me the question becomes if we can make this into something that has a good path. I see that in your roadmap. It makes sense doing things in that order. If we take the above 4 plus the statuspage customer request we have a giant mix of things that might have a good path through that is customer focussed.

Is that just somebody sitting down and writing stuff, then I theoretically can possibly do it. Or is it a group in a hangout discussing and writing? what is your experience what works here. without too much overhead.

teemow commented 6 years ago

You can gladly write down input here if you want. I'll work on the final stories for the team. But I'll start working on them once we have time for them. There are other more important stories to work on atm.

othylmann commented 6 years ago

I think I linked the most important points through tickets up there at the top. In general the point is that customers would like to:

Know that the clusters are fine, which is also part of the contracts in terms of us checking a service in the cluster behind our ingress to make sure our requests go through the "whole" cluster.
Have a page whatever they can point internal customers to to check if the cluster is fine as currently it is: developer asks internal team if cluster is fine, internal team asks us, we reply back, they reply back. That is annoying.
Above that they would like to know if we already know something, if we are already working on something, if something will happen in the next 24 hours or so. Again, via a site so the internal teams know that. At best with API to be integrated into their systems, which is probably even better.

marians commented 5 years ago

Minutes from a call with customer 2018-12-12

Rationale:
- have common overview and official info of the status of the infrastructure Giant Swarm provides
- enable engineers at customer to decide whether they need to investigate problems on their side
- self-service, prevent users asking questions which have to be answered individually
Role models
- Statuspage.io
- Cachet - https://demo.cachethq.io/
Status
- of systems/subsystems broken down
- high-level, traffic-light style, red/yellow/green
Events log
- Alerts, incidents could be displayed immediately in this system
- Downtimes
- Status changes
- Changes like upgrades
- Should reach back a couple of days at least, ideally unlimited
System levels
- Tenant cluster level very important
- Control plane less important, but interesting (e. g. Giant Swarm API availability)
Annotations by customer
- customer could imagine adding mark/events/incidents to the system by themselves
Link to Root Cause Analysis (RCA)
- so the RCA for an incident is available from the incident info
Integration
- customer has Atlassian statuspage which would ideally contain all the information described above
- customer will check the license model to see whether external data can be integrated
- Marian will look into Statuspage API, also Cachet, for API design supporting simple integration
Access control
- Not publicly accessible
- IP/network based access control would work
- Ideal: Access only with customer AD account and OIDC auth
- Must not require a Giant Swarm account to access, as only very few users have one
Next steps
- customer, Marian speak again in January. Marian invites.
- Preparation:
- Marian collects example alerts, incident snippets for what could be integrated in the system
- Marian looks at APIs
- customer talks with team regarding more detail requirements for incident and status info

marians commented 5 years ago

Ran into this: http://status.masto.host/ is created by https://nixstats.com/. Might be usable as another example in next round with AA.

J-K-C commented 5 years ago

As a customer, I want to be able to easily and efficiently check the visibility of the state of my clusters so I know whether they require attention or not.

J-K-C commented 5 years ago

What data do we want to display.

marians commented 5 years ago

Next steps:

[x] Collect what info/data we can provide today (Marian)
[ ] Competitor analysis (Jessica)
[x] Who can participate in spec? (product, tech) (Jessica)
[ ] Define which additional data we must make available (TBD)
[ ] Create a vision (TBD)
[ ] Define MVP scope (TBD)

marians commented 5 years ago

I started breaking down what this story may be about, as a basis for upcoming brainstorming sessions. Credits to Marcel for helping me.

On the one hand I would like to keep us open minded about the vision. I know our customers mean a lot of different things at once. We will have to boil it down and draw a meaningful path. So to start somewhere, I started mapping out what could flow into the thing we are talking about here.

https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

Typical disclaimer: WIP, early stage, nothing set in stone etc. More iterations needed even for the simple stuff in there.

puja108 commented 5 years ago

This is really good, to me all of those (maybe besides customer workload health) make lots of sense and should be there eventually. Thanks for structuring this, really nice

marians commented 5 years ago

We'll discuss tomorrow with more customers and users. Let's see how they see the differentiation between cluster health and workload health.

marians commented 5 years ago

For future reference, here is a very high level plan on how to tackle the topic:

https://docs.google.com/drawings/d/1CNzDAk6HPqeE8qqVtBu29iAZaauvPdP5Z9_XB7Bt4BM/edit

I collect some more input in this slides document: https://docs.google.com/presentation/d/1I2_hz-bkOOK2--AqP63cUKREtLNnx6sfZJfJebFeG-Y/edit#slide=id.p

marians commented 5 years ago

Data/Info collection started here: https://docs.google.com/spreadsheets/d/1nx2VdCwDEP2e5DlobjtFr7ANZTDso6_xDvzrs6sJu0A/edit#gid=0

marians commented 5 years ago

From giantswarm/giantswarm#6511:

As volumes being filled is a constant source of trouble in day to day life, it would be quite valuable to have volume usage date available in the MVP. I suggest that we discuss alternatives to acquire that data (currently provided by the tenant cluster node exporter, fetched by Prometheus) without relying on Prometheus.

puja108 commented 5 years ago

What are the reasons behind not using Prom and what would be the alternative?

JosephSalisbury commented 5 years ago

Yeah, not using Prometheus for this doesn't feel like the right direction

J-K-C commented 5 years ago

Does not scale and is not reliable enough according to @teemow. However, @cornelius-keller will be deep diving into the technical feasibility/ architecture of this feature at some point and then we can get some more data for a more data driven discussion. Please watch this space :)

cornelius-keller commented 5 years ago

As for now Prometheus is to my knowledge the only Source for metrics data I think we can use it. Also even if we are looking for a replacement it seems to me that this would take much longer then we want to wait for this feature. It seems a bad idea to couple Prometheus replacement with this story.

So far far as I understand the architecture by now we anyways need an api endpoint to expose the metrics data to the front end. This can act as a facede to Prometheus for now. Ideally we can replace Prometheus later without changing this api endpoint.

marians commented 5 years ago

Let me reconstruct what I remember from @teemow's statements on Prometheus as we run it currently, with respect to this story, at the risk of getting it wrong:

Prometheus is currently in a state where it doesn't seem suited for higher usage. "The current setup doesn't scale".
We should focus first (=for the MVP) on data we can get without relying on Prometheus.

I think it's up to us to look at this in more detail. Like so:

If we decide, for example, to query tenant cluster volume data from Prometheus, how will this affect the resource usage and stability of Prometheus?
How will the query pattern look like? (Likely: All tenant clusters, all volumes, once a minute)
How can we deal with temporary unavailability of Prometheus data in the health UI? (In short: design for an "unknown" state in all metrics)

puja108 commented 5 years ago

If you come up with such assertions then please state what you are comparing it against. "without relying on Prometheus" implies you rely on something else (at least for the metric that you are using as an example here). What system will you ask? Do you write your own? Will you keep it in a time series DB? Will you write your own DB or use Influx? No matter how you answer these questions, I do not see a solution without a tool "like" Prometheus, so like Cornelius said, why not just rely on what we have instead of building something completely redundant next to it? Also, if you want to put in the work, why not work on making Prometheus better or build workarounds (e.g. a cache) for the cases you mention? Also, keep in mind that currently the context of this is Happa (AFAIK) and it is not a heavily used tool, so any query load you generate is short-lived and usually pertains to single users. Yes, it will need to scale at some point, but so does our Prom setup need to scale, we already know that as we rely heavily on it for our SLA. That all said, if there's data you can get without Prom, please do so, but include the effort of building your tooling in your thoughts on this, otherwise this story might be blown out of proportions.

marians commented 5 years ago

The info and details (only some of them deserve the term "metrics") we want to focus on in the MVP are the ones we can get from

Kubernetes API: e. g. node details such as capacity, requests, limits
Provider resource details coming via our CRs, etc. AutoScalingGroup details coming via the cluster resource or whatever the node pool equivalent will be.

There might be more. These are two examples.

See https://github.com/giantswarm/giantswarm/issues/6139#issuecomment-516847947 for a visual representation of this sort of details on the node level. EDIT: The volume data is currently only available in Prometheus.

puja108 commented 5 years ago

Those look fine to me, and most really do not need Prometheus, which is cool. I am not 100% sure on the volumes stuff, but if that is the only thing in there that is not available through K8s itself, then I would skip it in this phase.

Getting things K8s APIs is definitely ok. The backend for the metrics you will get from the API currently come from metrics-server component, but in the future we might even have local Prom serving that, so as long as you ask K8s API for metrics we do not rely on a single backend.

BUT on K8s API I would also be careful about hammering the APIs and have some caching and not too much "live" data involved.

J-K-C commented 5 years ago

Hi all, suggest we park this discussion for a moment as the Architecture deep dive has been assigned to Cornelius of which it is only his 2nd week plus, Timo is AFK. Tomorrow we have an introductory session to this epic for Cornelius and the Ludacris team where we will also be reviewing the MVP and from there, I suggest we book a couple sessions where we can have some data driven discussions around this.. maybe a session in Rome if the timings right.

puja108 commented 5 years ago

I don't think there's more need for discussion here. All good, move forward.

teemow commented 5 years ago

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

puja108 commented 5 years ago

My considerations were that in this story we are talking about the current state only. This is and should be in the status section of our CRDs. This is a very reliable source.

This story isn't about metrics and timeseries.

Fully agree!

Prometheus isn't reliable. We had many flapping prometheus in control planes already. It can be easily wiped. Let's say the data structure is less defined and versioned. It is at its limit. So presenting metrics to the customer needs to wait until we have worked on the prometheus topology and maybe long term storage.

Ok, this is maybe where the confusion came from. What I read from this is not "we should build sth else for metrics and replace prom" but "we need to improve our prom/metrics setup before we can rely on it for metrics", right?

teemow commented 5 years ago

"we need to improve our prom/metrics setup before we can rely on it for metrics"

yes

cornelius-keller commented 5 years ago

For now me @MarcelMue and @tfussell agreed on serving as a minimal viable product only information that we have either already in the operator or can fetch from k8s api server. That way in the first step Prometheus and other sources are out. This gives us more time to wait for stabilizing the prometheus situation and deliver something to iterate over in the mean time. The information will be gatherd in the CRD, as this is already exposed via the api and we will not need new endpoints etc.

marians commented 5 years ago

@MarcelMue @tfussell I took the results from our on-site session, wrote them into something that should be understandable for our users and made it a docs PR that we can keep open to modify as long as we are working on the first iteration.

When reflecting our outcome, I made a little change I would like to speak to the two of you about. In a nutshell:

We defined a rule like "cluster is only green if N nodes are in Ready state", with N being a slightly bigger formula. This means that on the cluster level we take arbitrary node details into consideration. I'd like it simpler:
"cluster is only green if N nodes are green". This means that on the cluster level, we only look at the health assessment of the nodes, not at the details that led to this assessment. This makes the aggregation concept easier to explain and to understand. Probably it also makes it easier to implement.

PR: https://github.com/giantswarm/docs/pull/344

teemow commented 5 years ago

From my point of view we should not show a green cluster if nodes are down. People using happa want to manage clusters. Not workloads on top of the cluster. Workloads on top of the cluster might still be green yes, but something is going on with the cluster if a node is not ready. And people should see this right away.

So eg:

NotReady node the cluster is yellow or orange
N nodes are NotReady cluster is becoming red
Node is NotReady but the cluster is upgrading maybe "blue"

marians commented 5 years ago

When scaling up, the cluster has several nodes NotReady for a while. Would you want it to be turning from green to yellow until the new nodes are Ready?

Regarding upgrades, I think our goal should be to adjust both the rules and the upgrade logic so that the cluster stays green during the upgrade. Reaching this goal is beyond the scope of this story, as it requires HA masters, but I'd like to keep it in mind. Until then, I'd be fine with the cluster turning yellow during an upgrade. I don't see a need for another color based on the definition of the three categories.

teemow commented 5 years ago

Scaling up is similar to upgrades. We need to distinguish the state of wanted and unwanted failure. We do the same thing within our alerts already.

Yes you should not work on upgrades. But it is the same already. We don't get paged while we perform an upgrade. The system knows that state.

teemow commented 5 years ago

Regarding the other color. How would you determine wanted and unwanted failure then?

marians commented 5 years ago

How would you determine wanted and unwanted failure then?

Ideally we give users access to the reasons why a cluster is considered passable (yellow) or bad (red) via the UI.

teemow commented 5 years ago

I am pretty sure this will cause a lot of confusion. Yellow indicates something is not according to plan. But during an upgrade node status changes according to a plan. There is nothing to worry about. Still people need to have an indicator for this. I don't mind what colors we use for these states. But imo we should distinguish those 4 cases.

All good
Something happening but all according to plan
Something degraded
Something broken

And speaking of it. During an upgrade the cluster isn't even degraded as in most cases we launch a new instance before the old one is teared down.

cornelius-keller commented 4 years ago

I have looked again through all the history of this story and the related work especially from @marians . Initially it seems I have underestimated the technical complexity and the different requirements to this story from customer / ui perspective vs the internal technical challenges like moving towards cluster api, and having an operator readable cluster status where other operators can react on.

After all I would like to suggest a new MVP. I suggest to have a very simple traffic light status per cluster, based in the beginning only on the number of desired nodes and the number of ready nodes.

If there are all desired nodes ready -> cluster is green. If there are between 1 and 20 % of the desired nodes not ready -> cluster is yellow. If there are more then 20 % of the nodes not ready -> cluster is read.

A node that is not there at all because for example it is not created yet by the infrastructure or because the infrastructure failed will be considered as not ready.

In the first iteration I think we can ignore the intermediate state. As @teemow pointed out during upgrades the new nodes are created before old ones are deleted. So the cluster should stay green.

In other cases I think it is consequent and most easily to explain what happens if we keep the status evaluation simple in the beginning. If I look at for example elastic search a cluster becomes yellow if you add a node and it starts to re balance shards. Yellow would just mean "Desired state is not current state, but it is not bad yet". Whereas red means "Desired state is not current state and it is probably bad".

I would like to have this for all clusters, regardless if they are using node pools or not. We could have the same thing per node pool if the cluster uses them.

I think this is the minimal thing that provides user value and does not cause to many technical uncertainties as the information is probably already existing or could be easily added to the current CRs.

Based on customer feedback we can then decide to add more information or states to the traffic light or for example work on having single sign on for grafana so that we can reuse the dashboards that we have there.

WDYT?

teemow commented 4 years ago

Sounds good to me. Small steps will help us to align with cluster-api upstream and the different implementation levels we have in the operators.

Will this distinguish between "node not ready" and "api unavailable"?

Btw on Azure we don't create new instances first and then tear down the old one afaik. Not sure about KVM. On AWS the ASG definitely creates a new instance before the old one is teared down.

cornelius-keller commented 4 years ago

I think extending this in a way that master down would mean also red should be easy. With multi master the semantics would then probably to all masters down.

Regarding the states during creation and upgrading: I still think that this is easy to explain to the users, and I would like to add more states based on user feedback.

For a three node cluster this even means that it turns red during updates if we don't create the new node before adding a new one. But if you run a three node cluster in production this is actually a bad thing as it means you lost 1/3 of your capacity and this will probably affect your workload. So showing it red during the update still seems appropriate to me. On the other hand we could think on having a threshold of 5% before switching from green to yellow, so big clusters don't turn yellow during upgrades. But also this I would like to tweak after customer feedback.

puja108 commented 4 years ago

Going in the right direction. I suggest you evaluate two things in more detail. Both will need a bit of thoughts and maybe even some tryouts and tests either manually or automated:

What is the exact percentages/limits you want status to be changing at?
Should status be separated for masters vs nodes and between node pools?

Point 2 might be something you will just leave out of MVP, but need to be careful that if you decide to not separate anything you will need a bit more detailed communication to not confuse people about their status. In my experience people worry in different ways about API down vs nodes not ready.

snizhana-dynnyk commented 3 years ago

A customer mentioned this in a feedback call about our monitoring - that they would like to have some sort of a traffic light system for workload clusters. This would help them to develop trust in our monitoring and also give an overview of workload clusters' health.

I will add this issue to a Product board so we can discuss it on Monday. We might consider implementing this traffic light system in Management API.

Additionally, defining the concepts of 'red' / 'yellow' clusters might be useful for our internal operations, e.g. prioritizing postmortems or even using the min number of 'red' clusters as an outcome of the reliability goal.

puja108 commented 3 years ago

This still makes sense in Ludacris, but we'd currently not prioritize it very high, as the source of truth for such health being in CRDs will change with CAPI (and upstream CAPI has similar health stories they are thinking about). Thus, I would revisit this once we're further on CAPI unless there's increased priority from the customer side on this.

For now I would also say that giving access to our grafana dashboards should at least start giving a first picture, not in our own interface but at least in a first interface. Ludacris will definitely also look at what dashboards they own and how those will be experienced by customers.

JosephSalisbury commented 1 year ago

@puja108 can we find a home for this? it looks lost somehow

puja108 commented 1 year ago

I'd move this over to one of the KaaS teams, so it can be checked against and maybe merged with the general cluster health within CAPI story. The idea back in Ludacris was mainly, let's for now rely on the health status we get through CAPI. We'd then need to see where we expose it aka dashboard vs/and happa/kgs. I know for example that the azure CLI and in some way also clusterctl show an aggregate cluster health in the command line.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

alex-dabija commented 1 year ago

@puja108 can we find a home for this? it looks lost somehow

The issue is not lost. We agreed in KaaS (some time ago), that Ludacris' backlog we'll stay on the KaaS Sync's board until either Rocket or Hydra has a need to implement the feature.

We (@cornelius-keller, @gawertm and me) discuss it quickly today in the KaaS Product Sync and agreed that it's still best to pull in the story in one of the teams when it's needed.

cc @alex-dabija @gawertm @cornelius-keller which team would be closest to this right now?

Unfortunately, it's difficult to say which team is closest because we are mostly focused on having stable clusters.

teemow commented 4 months ago

@puja108 @marians this is still interesting in terms of fleet management. Especially for an interface like backstage in which you can drill down (health) information about an installation or cluster. Eg seeing the state of applications on a cluster or the current alerts of the cluster itself.

marians commented 3 months ago

I'm putting these three related issues on Honeybadger's board for Backstage.

giantswarm / roadmap