giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Report about the nginx ingress controller in the SLA reporting dashboard #3229

Open QuentinBisson opened 3 months ago

QuentinBisson commented 3 months ago

Atlas recently added a new dashboard to report about our SLAs in this PR https://github.com/giantswarm/dashboards/pull/433/files.

We have SLAs on nginx and those should be added to this dashboard.

What we would like is for you to create a slo using sloth the expose the availability of the public nginx ingress controller that is optionally deployed in the clusters.

Here is an example of an slo alert https://github.com/giantswarm/sloth-rules/blob/5a0ee4a6c65dfdd849d197cd3b873d8fa342dec4/areas/kaas/teams/phoenix/apiserver/slos/availability.yaml

Feel free to ask @giantswarm/team-atlas for help to create the SLO rule and alert and to integrate the data in the dashboard

QuentinBisson commented 1 month ago

@weatherhog when do you expect your team to work on this?

weatherhog commented 1 month ago

@QuentinBisson I understood that this is not urgent and that's why it is living in our backlog at the moment without a clear ETA. Let me know if there is some urgency from atlas to get this done, so that we can rearrange in cabbage.

QuentinBisson commented 1 month ago

Coming from the umbrella issue I thought it was quite important for customers https://github.com/giantswarm/roadmap/issues/3228#issuecomment-1960945105 :D

QuentinBisson commented 1 month ago

@LolloneS any thoughts on the matter?

mcharriere commented 1 month ago

What we would like is for you to create a slo using sloth the expose the availability of the public nginx ingress controller that is optionally deployed in the clusters.

I think we need to get this a bit more clear before proceeding: What does availability mean for our customers?

Let's say we used the nginx_ingress_controller_requests metric, that'd actually say nothing about nginx but rather about the workload. Any other metric showing the "controller" health (like admission) might look ok, but it may not have enough samples to provide any useful information and it doesn't actually show that nginx works.

There is then, the "latency" aspect of the availability. What value is considered fast enough to pass the availability check?

QuentinBisson commented 1 month ago

I think this should be about our availability not the one from the customer though as their endpoints have different response times and so on but then it's up to you to define what you think is availability

mcharriere commented 1 month ago

As a customer, I want to be able to see the availability of the important kubernetes components and check if they are respecting the agreed SLAs.

I'm taking this from the main ticket. I don't think there is a correct definition of what we need to do here.

maybe @LolloneS can help us here.

Also, the dashboard is SLO Reporting on Grafana, right? Are you going to make it public at some point?

QuentinBisson commented 1 month ago

It is currently private but yes it will be public once all cabbage SLOs are there. I agree that this is too broad and you need to define what the availability is . The agreed SLA in the contracts is of 99.5 availability but then we should be the one to decide what we mean with availability

LolloneS commented 1 month ago

@mcharriere I think that, as always, we should start with something simple and then iterate on it.

mcharriere commented 1 month ago

availability in this case refers to the Nginx deployment actually being up&running and serving requests. a ping once a minute to a dummy Ingress is all we need

I see. I added this to refinement to discuss our options. At first glance, I don't think we have the infrastructure to generate synthetic traffic from outside the cluster, but maybe from inside cluster is enough and simpler to implement.

QuentinBisson commented 1 month ago

I think we should bé able to use the blackbox exporter or a prometheus operator probe from thé MC though

mcharriere commented 1 month ago

I think we should bé able to use the blackbox exporter or a prometheus operator probe from thé MC though

As we recently learned in SIG arch, that's not always guaranteed. For example, let's take the case of a public installation, where the MC probes both the public and the private ingress controller running on the WC.

The public ingress will work fine, but the private ingress will fail. And vice versa.

image

Also, a dummy ingress means that the unavailability of that backend counts for the SLA. Also, we could probably use the default-backend so there is no need for an App on its own.


To try to avoid all this MC/WC connectivity issues, I'd say we start with in-cluster traffic alone. We could change net-exporter to query some predefined ingresses pointing to the default-backend.

image

mcharriere commented 1 month ago

We decided for the first iteration to go with a solution confined to net-exporter, where we expose a /healthz endpoint and install N ingresses; where should be check every some customizable interval.

image

We'll report the SLO based on those metrics.

kopiczko commented 1 month ago

Things we agreed on in Slack:

kopiczko commented 4 weeks ago
kopiczko commented 12 hours ago

Feels like everything is done on our side. We have:

The small thing that's left is ops recipe, but PRs are open.

I'd like to create a small follow up issue with those tasks:

And close this. @QuentinBisson WDYT?