Provide a synthetic monitoring pipeline

QuentinBisson commented 5 months ago

Initial discussion coming from https://github.com/giantswarm/giantswarm/issues/22643

Motivation

It is crucial that we can ensure our important components (list to be defined later) are monitored properly and are working as expected. To that end, we need to have tools to be able to do synthetic monitoring for our components.

In a first place, we should provide tooling that can be used to check the components is accessible from the MC (like blackbox exporter) but we should also think about providing synthetic monitoring from outside the installation especially for ingresses.

Details, Background

@giantswarm/team-turtles is already working on using the blackbox exporter to check that the api server is accessible and responding from the MC, let's check if that tooling is enough.

I'm not sure how such things would work with private clusters. @giantswarm/team-rocket maybe you have some thoughts?

Blocked by / depends on

glitchcrab commented 5 months ago

Where do you imagine external checks for ingresses will be run from?

vxav commented 5 months ago

I'm assuming some service running in the cloud that does an HTTP GET on a url.

The only thing I can see working for private cluster is an agent based outbound reaching service, similar to teleport in a way.

QuentinBisson commented 5 months ago

I'm not really sure @glitchcrab that would be part of the discovery phase. Either one of our WC or a cloud service or smth else :)

glitchcrab commented 5 months ago

@QuentinBisson sure, my reason for asking is that if it's running somewhere which is under our control then you can do stuff like sending traffic through our VPN etc. That's not going to be possible from a cloud service unless you run it through a proxy which forwards it through our VPN (and then you'd have to do some fudgery with DNS records). Airgapped installations like gerbil have no internet-facing ingress at all.

QuentinBisson commented 5 months ago

For sure and then I guess this would be a requirement. but is the VPN not going away eventually ? How would that work then?

Also, it's possible airgapped installations will not be able to have this feature 🤔

glitchcrab commented 5 months ago

The VPN can't disappear entirely (AFAIK) because we use it to access resources which cannot be 'teleported' - such as onprem UIs for vsphere etc

QuentinBisson commented 5 months ago

Oh then it would stay for onprem only ?

glitchcrab commented 5 months ago

Oh then it would stay for onprem only ?

I couldn't say for certain, but unless there's some Teleport magic which I don't know about then yes, we still need it in some cases.

gawertm commented 5 months ago

we are working on that teleport magic in BigMac.. Also bootstrapping an MC is still an issue. but generally the goal is to remove the VPN completely. lets see if we can achieve that to 100%

LolloneS commented 5 months ago

I wouldn't overfocus on the ingress topic. As a starting point, even having a simple ping from inside the cluster that pings $customer-online-shop.com and checks that it returns a 200 would be nice. Then we can improve, iterate, etc., but as of now even that one would be good for ingresses.

LolloneS commented 3 months ago

@QuentinBisson @Rotfuks can we ensure this is worked on in the next 2-3 months at most so that then we have another 2-3 months to put the SLA reporting based on it together? Thanks.

Rotfuks commented 3 months ago

For now we have put this issue on ice as it is an extension of the existing sources of data. We are currently focusing mainly on stabilizing and unifying the existing sources of data by the mimir introduction and making sure loki runs everywhere as stable as possible. It's definitely a high value opportunity to add synthetic monitoring, but we currently see some other topics in a higher priority, so I can't promise anything on the roadmap yet. Is there a definitive time critical need for this feature or would it just be really nice to soon have it?

LolloneS commented 3 months ago

My opinion: we need it within 3-4 months so that teams then have 1-2 months to implement proper SLA reporting.

Rotfuks commented 2 months ago

Hmm...I'm not sure that fits into our current roadmap as we want to focus getting metrics and logs in a clean state for everyone in the next months. So no promises. But I'll keep it on our radar, for once we have some more wiggle room to add new stuff on the roadmap.

giantswarm / roadmap