Discussion: network security of installation-internal services

piontec commented 5 months ago

We want to run an installation-wide service on an MC. This service, to be more specific, is zot, a container images cache. We want this service to be accessible from the MC and from all WCs of this installation, but not elsewhere. One possible solution is to use our public Ingress and expose it there. Still, by making it public, we're risking the service and its resources being DoS'ed. To avoid that, we can enable auth on ingress, but we also want to avoid it like hellfire, as it will make us configure all the containerds of all nodes to use this auth token - the token distribution itself is a pain, but even worse, a potential secret leak would mean we have to reconfigure all the nodes of all clusters.

As such, our idea is to make this service available without any auth, but limited on the networking level to the installation itself only.

(ping @stone-z and @allanger for visibility and shared interest)

Now, here are some of the questions that come to mind :

I already learned we have a feature called "private ingress" - an ingress, that is internal to a VPC the installation runs in. It is available for AWS and should be soon available for Azure. @kopiczko @erkanerol, please correct me if I'm talking bullshit or this feature doesn't work like this at all and won't help us in our use case.
I think this use case is much wider. Atlas could use something like that, but currently they have some workarounds that distribute access credentials (CC @QuentinBisson). As such, it seems we have more and more use cases that need such functionality. @alex-dabija @T-Kukawka is this somewhere on your radar as a general feature? Does it make sense to make a private ingress a default feature of each MC?
If 1 is yes (it works this way) and 2 is yes (we want it as default MC feature), what's the status on it? Is it already available somewhere? Will it be? We need something pretty soon, even if that's not the final solution. And yes, this feature might be needed to run zot for WEPA on the factories locations. We still don't know what WEPA's setup will be like, so we're not sure if we want per-factory-location there or maybe we want to run zot on every WC node. Again, CC @alex-dabija and @gawertm.
How complicated is this use case for using on-prem? How can we get it there? Again, think WEPA and the on-edge deployments (but again, maybe for WEPA we want to the cache to run on every edge node? Will WEPA run multiple clusters in a single edge location?) Still, for "reliably connected VMWare clusters, the same setup as for cloud would make a lot of sense - CC @gawertm and @vxav .

kopiczko commented 5 months ago

I'm with @stone-z here about per-WC registries.

When it comes to storage:

for daemonstats we pull all the images in all the nodes anyway so it's like one more node when it comes to storage
for images that are running only on some nodes - they probably won't be that useful (or reused) in other clusters anyway (e.g. when a different version of an image is used, or the app is not running in another cluster at all)
and there shouldn't be a lot of images in general per single WC, at least I think so, how much do we give for the docker volume? 50GB (and that includes empty-dir)?

Also when it comes to cost:

traffic between on-prem and azure (or any public trafic) costs money
loadbalancers cost money (in local registry we could use a service)

When it comes to operational costs:

yes, there is more registries to operate but:
there are no LBs
no ingresses
no connectivity issues between provides, networks etc.
smaller latency
no auth
no DDoS issues
maybe running dozen of smaller registries is more reliable than running a single big one
blast radius is smaller when per-WC registry goes down

Also, there a natural evolution path if per-WC registry solution is not enough which is moving to shared registry. But if it is enough, then it's way simpler. Also it will be easier to prove if it makes sense at all to go with shared registry given the complications.

LolloneS commented 5 months ago

+1 to what Pawel suggested.

Thinking about other big customers and the fact that we might/will want the cache to work for their components as well in the long term for cost savings, SHA-matching, etc., I'd strongly suggest taking the following into account:

multiple low environments running "ephemeral" nodes (spot instances) constantly pulling from the MC, hammering it
multiple environments, including low ones, running a ton of stuff, which is all cached in a central place

As far as Lukasz's points are concerned:

operational load, where we have N services on N WCs instead of 1 on MC
- comment: N services might be "easy" to maintain whereas a single, huge service might be a PITA (think about prometheus now). Number of services is not the only metric here. Also, if the single big service hammers the MC, we are just moving the operational load elsewhere.
operational cost, same as above: N times disk, cpu and mem, for very little, if any, gain
- comment: meh. Especially if caching takes layers into account, it's not like storing our images is going to cost a lot. also, "N times" only holds for our components, whereas customers run different images in different environments on the same MC anyway.
much lower cache efficiency, or even efficiency, as all the images have to be fetched from upstream anyway for each and every cluster
- I can't foresee any cache efficacy if a single instance of zot needs to store images from $list-of-big-WCs-running-in-a-very-big-MC, to be honest

alex-dabija commented 5 months ago

@piontec Are you assuming that the MC and its workload clusters are in the same region? This assumption is no longer valid because:

CAPI can manage clusters across multiple region;
we want to support multi provider MCs as part of our product;

To avoid that, we can enable auth on ingress, but we also want to avoid it like hellfire, as it will make us configure all the containerds of all nodes to use this auth token - the token distribution itself is a pain, but even worse, a potential secret leak would mean we have to reconfigure all the nodes of all clusters.

Our CAPI clusters can be configured with additional registries and their access tokens. We already do it (or did it) for docker.io. Rolling the nodes of all clusters can be mitigated in a few ways:

each cluster could get it's own token (only one cluster would need to be rolled);
we could add some support to our OS to replace the token without restarting the node (similar to what the kubelet is doing when it refreshes certificates). Personally, I'm in favor of using credentials anywhere they are supported.

Does it make sense to make a private ingress a default feature of each MC?

We don't have any plans at the moment to add something like VPC peering or a transit gateway in order to make a private ingress work for public CAPA clusters. Private clusters already mechanisms for communication between MC and WCs. CAPA uses a transit gateway and CAPZ uses endpoints.

alex-dabija commented 5 months ago

We still don't know what WEPA's setup will be like

They want the cluster to still be functional (I assume to at least be able to create pods) in case it loses connectivity to the Internet (and the MC). For me, this means that the caching would need to happen within the local environment and not on the remote Azure environment.

piontec commented 5 months ago

OK, I know that right now it's not doable, and won't be in the close future - maybe even never. Assuming we don't have this functionality and closing.

giantswarm / roadmap

Discussion: network security of installation-internal services #3369