Open QuentinBisson opened 5 months ago
Where do you imagine external checks for ingresses will be run from?
I'm assuming some service running in the cloud that does an HTTP GET on a url.
The only thing I can see working for private cluster is an agent based outbound reaching service, similar to teleport in a way.
I'm not really sure @glitchcrab that would be part of the discovery phase. Either one of our WC or a cloud service or smth else :)
@QuentinBisson sure, my reason for asking is that if it's running somewhere which is under our control then you can do stuff like sending traffic through our VPN etc. That's not going to be possible from a cloud service unless you run it through a proxy which forwards it through our VPN (and then you'd have to do some fudgery with DNS records). Airgapped installations like gerbil
have no internet-facing ingress at all.
For sure and then I guess this would be a requirement. but is the VPN not going away eventually ? How would that work then?
Also, it's possible airgapped installations will not be able to have this feature 🤔
The VPN can't disappear entirely (AFAIK) because we use it to access resources which cannot be 'teleported' - such as onprem UIs for vsphere etc
Oh then it would stay for onprem only ?
Oh then it would stay for onprem only ?
I couldn't say for certain, but unless there's some Teleport magic which I don't know about then yes, we still need it in some cases.
we are working on that teleport magic in BigMac.. Also bootstrapping an MC is still an issue. but generally the goal is to remove the VPN completely. lets see if we can achieve that to 100%
I wouldn't overfocus on the ingress topic. As a starting point, even having a simple ping from inside the cluster that pings $customer-online-shop.com and checks that it returns a 200 would be nice. Then we can improve, iterate, etc., but as of now even that one would be good for ingresses.
@QuentinBisson @Rotfuks can we ensure this is worked on in the next 2-3 months at most so that then we have another 2-3 months to put the SLA reporting based on it together? Thanks.
For now we have put this issue on ice as it is an extension of the existing sources of data. We are currently focusing mainly on stabilizing and unifying the existing sources of data by the mimir introduction and making sure loki runs everywhere as stable as possible. It's definitely a high value opportunity to add synthetic monitoring, but we currently see some other topics in a higher priority, so I can't promise anything on the roadmap yet. Is there a definitive time critical need for this feature or would it just be really nice to soon have it?
My opinion: we need it within 3-4 months so that teams then have 1-2 months to implement proper SLA reporting.
Hmm...I'm not sure that fits into our current roadmap as we want to focus getting metrics and logs in a clean state for everyone in the next months. So no promises. But I'll keep it on our radar, for once we have some more wiggle room to add new stuff on the roadmap.
Initial discussion coming from https://github.com/giantswarm/giantswarm/issues/22643
Motivation
It is crucial that we can ensure our important components (list to be defined later) are monitored properly and are working as expected. To that end, we need to have tools to be able to do synthetic monitoring for our components.
In a first place, we should provide tooling that can be used to check the components is accessible from the MC (like blackbox exporter) but we should also think about providing synthetic monitoring from outside the installation especially for ingresses.
Details, Background
@giantswarm/team-turtles is already working on using the blackbox exporter to check that the api server is accessible and responding from the MC, let's check if that tooling is enough.
I'm not sure how such things would work with private clusters. @giantswarm/team-rocket maybe you have some thoughts?
Blocked by / depends on