Ensure the external connectivity to api servers on the workload clusters

snizhana-dynnyk commented 3 years ago

User Story

- As a Giant Swarm Engineer, I would like to know immediately if there is a problem with connectivity to an API server on a workload cluster so that I can notify the affected customer and resolve the issue.

Rationale

Currently, we only monitor the connection to the API server from the management clusters. However, it may happen that there is a connectivity issue between the API server and e.g. customer CI system. We want to know about such issues immediately. There are potentially other systems besides the API server for which we want to monitor the external connectivity.

TODO

[ ] Use the blackbox-exporter on the MC to check against the APIServer on the WC
[ ] Investigate if the blackbox exporter scales with newly created WCs

snizhana-dynnyk commented 3 years ago

Ping @othylmann this is a public issue, I will be posting updates here.

QuentinBisson commented 8 months ago

@Rotfuks I think this should belong to Turtles instead if atlas. What do you think?

Rotfuks commented 7 months ago

Hey @weseven! Since you're looking into the APIServer Alerts and Monitoring with giantswarm/giantswarm#29116 is this here also something we already looked at with that or would it be an additional check from outside the cluster?

weseven commented 7 months ago

I think this is something new, if I understood correctly it would require a connection test from something else other than the current monitoring stack, so I think it's unrelated at the moment. I'm not sure if there are already similar checks in place for components.

Rotfuks commented 7 months ago

Alright, thought as much.

So we need some external source that pings the API - could we maybe use some tekton pipelines for that? What do you think @giantswarm/team-tinkerers, would that be some check use case we could setup with your pipelines ?

AverageMarcus commented 7 months ago

It could be done using Tekton Pipeline but I don't think that is the appropriate tool for this. If I'm understanding this correctly we're basically lookup for an uptime monitor, right?

I'm having a hard time understanding what we are trying to achieve that the current monitoring stack doesn't provide.

However, it may happen that there is a connectivity issue between the API server and e.g. customer CI system.

The only way we can really monitor for this is to run something in each customers own network.

This issue seems to indicate that just because the MC can connect to the WC api-server that isn't enough to say that the customer can. Similarly, just because we're able to connect to the WC from anywhere else in our infra (Tekton) it also isn't enough to say that the customer can.

I'm also concerned at the amount of access the Tekton instance would then have. We'd be creating a single point in our infrastructure that has access to ALL of our customers clusters (assuming we're testing for actual api-server successful response).

QuentinBisson commented 7 months ago

I think one of the issue with the current monitoring stack (in-cluster agent) is that we cannot ensure that the api-server is accessible from outside. The only place we can technically act upon would be the MC so I think having something like the blackbox exporter ping the WC api server (could be once every 5 minutes) from the MC would already be a good starting point.

Rotfuks commented 7 months ago

Hmm...Looking at the creation date of this issue it's not too critical. But I see the appeal of it. I'll add it to our Observability Improvements Epic in the backlog and we revisit it later. Thanks for bringing it up!

QuentinBisson commented 7 months ago

Thank you I removed it from Atlas board :)

QuentinBisson commented 7 months ago

@giantswarm/team-turtles for dashboarding, you might want to use this as a reference https://github.com/giantswarm/giantswarm/files/14012738/Cluster.Uptime-1705676241066.json especially since we have 3 dashboard that exposes cluster api-server uptime

Rotfuks commented 7 months ago

This is a related issue setting up infrastructure for synthetic monitoring: https://github.com/giantswarm/roadmap/issues/3225 Maybe we should wait on these results, if the priority here doesn't become more critical.

LolloneS commented 6 months ago

@Rotfuks on which results are you waiting? The other issue points to this one and states that you as Turtles are working on checking if Prometheus Blackbox Exporter is a good solution. To me it reads like this ticket waits on the other and the other waits on this one.

From this ticket: Maybe we should wait on these results, if the priority here doesn't become more critical.

From the other ticket: @giantswarm/team-turtles is already working on using the blackbox exporter to check that the api server is accessible and responding from the MC, let's check if that tooling is enough.

Can we just get started? We have the prometheus blackbox exporter already IIRC, it shouldn't be anything crazy to test out.

weseven commented 6 months ago

As per today's daily, waiting for the blackbox exporter to be configured & setup in MCs.

giantswarm / roadmap