Report about the nginx ingress controller in the SLA reporting dashboard

QuentinBisson commented 3 months ago

Atlas recently added a new dashboard to report about our SLAs in this PR https://github.com/giantswarm/dashboards/pull/433/files.

We have SLAs on nginx and those should be added to this dashboard.

What we would like is for you to create a slo using sloth the expose the availability of the public nginx ingress controller that is optionally deployed in the clusters.

Here is an example of an slo alert https://github.com/giantswarm/sloth-rules/blob/5a0ee4a6c65dfdd849d197cd3b873d8fa342dec4/areas/kaas/teams/phoenix/apiserver/slos/availability.yaml

Feel free to ask @giantswarm/team-atlas for help to create the SLO rule and alert and to integrate the data in the dashboard

QuentinBisson commented 1 month ago

@weatherhog when do you expect your team to work on this?

weatherhog commented 1 month ago

@QuentinBisson I understood that this is not urgent and that's why it is living in our backlog at the moment without a clear ETA. Let me know if there is some urgency from atlas to get this done, so that we can rearrange in cabbage.

QuentinBisson commented 1 month ago

Coming from the umbrella issue I thought it was quite important for customers https://github.com/giantswarm/roadmap/issues/3228#issuecomment-1960945105 :D

QuentinBisson commented 1 month ago

@LolloneS any thoughts on the matter?

mcharriere commented 1 month ago

What we would like is for you to create a slo using sloth the expose the availability of the public nginx ingress controller that is optionally deployed in the clusters.

I think we need to get this a bit more clear before proceeding: What does availability mean for our customers?

Let's say we used the nginx_ingress_controller_requests metric, that'd actually say nothing about nginx but rather about the workload. Any other metric showing the "controller" health (like admission) might look ok, but it may not have enough samples to provide any useful information and it doesn't actually show that nginx works.

There is then, the "latency" aspect of the availability. What value is considered fast enough to pass the availability check?

QuentinBisson commented 1 month ago

I think this should be about our availability not the one from the customer though as their endpoints have different response times and so on but then it's up to you to define what you think is availability

mcharriere commented 1 month ago

As a customer, I want to be able to see the availability of the important kubernetes components and check if they are respecting the agreed SLAs.

I'm taking this from the main ticket. I don't think there is a correct definition of what we need to do here.

What is the agreed SLA? For which SLI?
availability is too broad

maybe @LolloneS can help us here.

Also, the dashboard is SLO Reporting on Grafana, right? Are you going to make it public at some point?

QuentinBisson commented 1 month ago

It is currently private but yes it will be public once all cabbage SLOs are there. I agree that this is too broad and you need to define what the availability is . The agreed SLA in the contracts is of 99.5 availability but then we should be the one to decide what we mean with availability

LolloneS commented 1 month ago

@mcharriere I think that, as always, we should start with something simple and then iterate on it.

the agreed SLA is 99.95% IIRC
availability in this case refers to the Nginx deployment actually being up&running and serving requests. a ping once a minute to a dummy Ingress is all we need

mcharriere commented 1 month ago

availability in this case refers to the Nginx deployment actually being up&running and serving requests. a ping once a minute to a dummy Ingress is all we need

I see. I added this to refinement to discuss our options. At first glance, I don't think we have the infrastructure to generate synthetic traffic from outside the cluster, but maybe from inside cluster is enough and simpler to implement.

QuentinBisson commented 1 month ago

I think we should bé able to use the blackbox exporter or a prometheus operator probe from thé MC though

mcharriere commented 1 month ago

I think we should bé able to use the blackbox exporter or a prometheus operator probe from thé MC though

As we recently learned in SIG arch, that's not always guaranteed. For example, let's take the case of a public installation, where the MC probes both the public and the private ingress controller running on the WC.

The public ingress will work fine, but the private ingress will fail. And vice versa.

Also, a dummy ingress means that the unavailability of that backend counts for the SLA. Also, we could probably use the default-backend so there is no need for an App on its own.

To try to avoid all this MC/WC connectivity issues, I'd say we start with in-cluster traffic alone. We could change net-exporter to query some predefined ingresses pointing to the default-backend.

mcharriere commented 1 month ago

We decided for the first iteration to go with a solution confined to net-exporter, where we expose a /healthz endpoint and install N ingresses; where should be check every some customizable interval.

We'll report the SLO based on those metrics.

kopiczko commented 1 month ago

Things we agreed on in Slack:

net-exporter deploys ingress with the helm chart (configurable) - UX-wise it isn't great because we'll need to have that configured in separate app, after (internal) ingress, or kong is installed
net-exporter exposes the metric and pings itself in the exporter

kopiczko commented 4 weeks ago

exporterkit PR https://github.com/giantswarm/exporterkit/pull/95 - this allows enabling exposing extra /blackbox endpoint in net-exporter (details in the PR description)
net-exporter PR https://github.com/giantswarm/net-exporter/pull/356 + https://github.com/giantswarm/net-exporter/pull/358 - exposing /blackbox endpoint + release PR
update net-exporter in default apps:
- default-apps-aws - https://github.com/giantswarm/default-apps-aws/pull/465
- default-apps-azure - https://github.com/giantswarm/default-apps-azure/pull/259
- default-apps-eks - https://github.com/giantswarm/default-apps-eks/pull/108
- default-apps-vsphere - https://github.com/giantswarm/default-apps-cloud-director/pull/257 + https://github.com/giantswarm/default-apps-cloud-director/pull/258
- default-apps-cloud-director - https://github.com/giantswarm/default-apps-vsphere/pull/237 + https://github.com/giantswarm/default-apps-vsphere/pull/238
ingress-sla https://github.com/giantswarm/ingress-sla-app/pull/4
sloth-rules https://github.com/giantswarm/sloth-rules/pull/179 (release after v0.29.6)
ops recipe: https://github.com/giantswarm/giantswarm/pull/30843 + https://github.com/giantswarm/sloth-rules/pull/187

kopiczko commented 12 hours ago

Feels like everything is done on our side. We have:

https://github.com/giantswarm/ingress-sla-app with requirements PR to be merged
(our dependency) net-exporter is released and all default-apps-* received updates (but they are not released yet), we'll need to wait for the releases and then update the requirements
SLO rules are merged https://github.com/giantswarm/sloth-rules/pull/179 - I think it's for Team Atlas to release and update dashboards

The small thing that's left is ops recipe, but PRs are open.

I'd like to create a small follow up issue with those tasks:

revisit https://github.com/giantswarm/sloth-rules/pull/179 PR and set cancel_if_outside_working_hours: '{{ include "workingHoursOnly" . }}'
update ingress-sla-app requirements PR with the actual versions when default-apps-<provider> is released

And close this. @QuentinBisson WDYT?

giantswarm / roadmap

Report about the nginx ingress controller in the SLA reporting dashboard #3229