giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Discussion: network security of installation-internal services #3369

Closed piontec closed 5 months ago

piontec commented 5 months ago

We want to run an installation-wide service on an MC. This service, to be more specific, is zot, a container images cache. We want this service to be accessible from the MC and from all WCs of this installation, but not elsewhere. One possible solution is to use our public Ingress and expose it there. Still, by making it public, we're risking the service and its resources being DoS'ed. To avoid that, we can enable auth on ingress, but we also want to avoid it like hellfire, as it will make us configure all the containerds of all nodes to use this auth token - the token distribution itself is a pain, but even worse, a potential secret leak would mean we have to reconfigure all the nodes of all clusters.

As such, our idea is to make this service available without any auth, but limited on the networking level to the installation itself only.

(ping @stone-z and @allanger for visibility and shared interest)

Now, here are some of the questions that come to mind :

  1. I already learned we have a feature called "private ingress" - an ingress, that is internal to a VPC the installation runs in. It is available for AWS and should be soon available for Azure. @kopiczko @erkanerol, please correct me if I'm talking bullshit or this feature doesn't work like this at all and won't help us in our use case.

  2. I think this use case is much wider. Atlas could use something like that, but currently they have some workarounds that distribute access credentials (CC @QuentinBisson). As such, it seems we have more and more use cases that need such functionality. @alex-dabija @T-Kukawka is this somewhere on your radar as a general feature? Does it make sense to make a private ingress a default feature of each MC?

  3. If 1 is yes (it works this way) and 2 is yes (we want it as default MC feature), what's the status on it? Is it already available somewhere? Will it be? We need something pretty soon, even if that's not the final solution. And yes, this feature might be needed to run zot for WEPA on the factories locations. We still don't know what WEPA's setup will be like, so we're not sure if we want per-factory-location there or maybe we want to run zot on every WC node. Again, CC @alex-dabija and @gawertm.

  4. How complicated is this use case for using on-prem? How can we get it there? Again, think WEPA and the on-edge deployments (but again, maybe for WEPA we want to the cache to run on every edge node? Will WEPA run multiple clusters in a single edge location?) Still, for "reliably connected VMWare clusters, the same setup as for cloud would make a lot of sense - CC @gawertm and @vxav .

kopiczko commented 5 months ago

I'm with @stone-z here about per-WC registries.

When it comes to storage:

Also when it comes to cost:

When it comes to operational costs:

Also, there a natural evolution path if per-WC registry solution is not enough which is moving to shared registry. But if it is enough, then it's way simpler. Also it will be easier to prove if it makes sense at all to go with shared registry given the complications.

LolloneS commented 5 months ago

+1 to what Pawel suggested.

Thinking about other big customers and the fact that we might/will want the cache to work for their components as well in the long term for cost savings, SHA-matching, etc., I'd strongly suggest taking the following into account:

As far as Lukasz's points are concerned:

alex-dabija commented 5 months ago

@piontec Are you assuming that the MC and its workload clusters are in the same region? This assumption is no longer valid because:

To avoid that, we can enable auth on ingress, but we also want to avoid it like hellfire, as it will make us configure all the containerds of all nodes to use this auth token - the token distribution itself is a pain, but even worse, a potential secret leak would mean we have to reconfigure all the nodes of all clusters.

Our CAPI clusters can be configured with additional registries and their access tokens. We already do it (or did it) for docker.io. Rolling the nodes of all clusters can be mitigated in a few ways:

Does it make sense to make a private ingress a default feature of each MC?

We don't have any plans at the moment to add something like VPC peering or a transit gateway in order to make a private ingress work for public CAPA clusters. Private clusters already mechanisms for communication between MC and WCs. CAPA uses a transit gateway and CAPZ uses endpoints.

alex-dabija commented 5 months ago

We still don't know what WEPA's setup will be like

They want the cluster to still be functional (I assume to at least be able to create pods) in case it loses connectivity to the Internet (and the MC). For me, this means that the caching would need to happen within the local environment and not on the remote Azure environment.

piontec commented 5 months ago

OK, I know that right now it's not doable, and won't be in the close future - maybe even never. Assuming we don't have this functionality and closing.