First-write-wins conflict resolution for root hostnames

rosenhouse commented 4 years ago

problem statement

As a cluster operator, I want to enable namespace admins to "claim" a root hostname for themselves in a fully self-service way, without my involvement. I also want to prevent them from creating conflicting claims. Instead, I'd prefer that the first HTTPProxy to "claim" a particular hostname "wins" and no others may be created for that root hostname.

Earlier slack convo about this

current behavior

If one HTTPProxy has a root fqdn of inventory.example.com and then someone writes a different HTTPProxy for the same fqdn, then it breaks both:
```
fqdn "inventory.example.com" is used in multiple HTTPProxies: mobile-team/inventory, warehouse-team/inventory'
```
If an operator specifies --root-namespaces to restrict the creation of root HTTPProxy resources, then namespace admins can't set up their own routes anymore. That prevents the desired "self-service" workflow

possible solution

Admission controller?

stevesloka commented 4 years ago

OPA might also be a way to solve this. If something restricts only allowing a single root to exist within the cluster, there wouldn't be any changes to make to Contour.

rosenhouse commented 4 years ago

Wow, good pointer! That is, in fact, an example in their documentation which points at this policy.

Would it make sense then for Contour to document an equivalent policy for the HTTPProxy resource?

rosenhouse commented 4 years ago

Ok, so after looking at this a bit, and a pointer from @cppforlife, I'm concerned that this usage of the OPA Admission Controller (as documented in my previous comment) would suffer from a race-condition, in which two conflicting ingress resources that are submitted simultaneously could both be accepted.

The OPA maintainers seem to acknowledge this risk in a brief slack chat we had. I opened a PR to the OPA docs to call this out.

I don't know how likely this issue is to show up in practice.

jchesterpivotal commented 4 years ago

I don't know how likely this issue is to show up in practice.

In a game vs nature, it will show up rarely. In a game against motivated attackers, it will show up pretty quickly.

jpeach commented 4 years ago

Additional info from external email thread:

In OPA Slack, I (@rosenhouse) asked:

I'm looking at the docs where there's an example about preventing two Ingress resources from having the same hostname... ... It seems that this depends on the OPA agent maintaining a fully-consistent, global view of all Ingress resources across the cluster, and reading from that view synchronously when enforcing the policy. What happens if two resources with the same hostname are submitted to the apiserver simultaneously? Is the OPA policy agent a singleton that serializes the requests somehow?

Patrick East replied:

OPA will process requests in parallel. The data it uses to evaluate is, as @Ash mentioned eventually consistent. So in the case you outlined I believe it could allow in the second ingress that is invalid. I'd be curious to know if gatekeeper has similar issues with policies that depend on external data.

So the race happens when 2 Ingress objects are posted and OPA evaluates them on different threads. Neither object is visible to the other's evaluation, so both are accepted. To me, this seems like a congenital problem for admission controllers that can only be solved by serializing the evaluation (and accepting the associated drawbacks).

rosenhouse commented 4 years ago

seems like a congenital problem for admission controllers

for admission controllers that rely on global state. For basic controllers that are simply validating contents of the single resource being admitted, it shouldn't be a problem.

that can only be solved by serializing the evaluation

yes, and by that singleton maintaining its own cache of the global state, or by fetching all the global state on every request it is validating. Maintaining the cache sounds better for performance. But deletes don't go through admission controllers, so it needs to be able to sync and invalidate in the background. Ugh.

youngnick commented 4 years ago

I think there's a tension between wanting to ensure that people can't clobber other people's config (which was one of the design pillars for HTTPProxy), and wanting the system to be entirely self-service.

When we designed HTTPProxy, the thing we were worried about was, effectively security. Security of the TLS keypair for a domain name, which we assumed would need to be controlled to some extent ( the TLSDelegation resource), and security of the domain name itself, to prevent accidental configuration clobbering.

One of the problems with the Ingress object is about ordering - it's not really defined which order an ingress controller will process resources, so you can get odd behaviour when there is overlapping configuration.

Part of HTTPProxy's design was about avoiding that, which is why we ended up with a difference between the root HTTPProxy and child HTTPProxys, and the --root-namespaces flag.

I think that the problem is solvable, but needs careful thought because of the distributed, eventually-consistent nature of the Kubernetes API. Implementing first-writer-wins semantics for FQDNs will require, effectively, a distributed lock on that domain name to avoid the problems listed above.

It's possible that could be done with an annotation (like in leader election), or use the new Lease object if it's available (only since 1.16, I think). However, that violates one of the internal rules we have when building Contour's data model about not changing the Kubernetes cache. So, again, careful thought required.

davecheney commented 4 years ago

To echo what Nick wrote, the design of the —root-namespaces feature is intended to give an administrator control over which HTTPProxy document gloms onto which vhost.

The pattern whereby contour is restricted to looking for roots in a specific set of namespaces enables the properties of the vhost — which is really just hostname and tls details, from route and tcpproxy parameters — is that someone (I’ll come back to this) places a stub httpproxy document in the root namespace and delegates all routes to a document in a different namespace. This gives contour a property which k8s cannot guarentee by itself; that a document in a name/namespace as sole ownership of the routes for a vhost.

Rather than OPA or writing an admission controller, I think it would be more straight forward, given that in your environment you know the ownership of each vhost, to write a pair of httpproxy records, one into the root namespace delegating to the user’ namespace, the other into the user namespace. Perhaps the second is optional, you might just tell users that they need to create HTTPProxy document at a specific name/namespace and it will be connected to a specific vhost.

Thanks

Dave

On 10 Feb 2020, at 12:08 pm, Nick Young notifications@github.com wrote:

I think there's a tension between wanting to ensure that people can't clobber other people's config (which was one of the design pillars for HTTPProxy), and wanting the system to be entirely self-service.

When we designed HTTPProxy, the thing we were worried about was, effectively security. Security of the TLS keypair for a domain name, which we assumed would need to be controlled to some extent ( the TLSDelegation resource), and security of the domain name itself, to prevent accidental configuration clobbering.

One of the problems with the Ingress object is about ordering - it's not really defined which order an ingress controller will process resources, so you can get odd behaviour when there is overlapping configuration.

Part of HTTPProxy's design was about avoiding that, which is why we ended up with a difference between the root HTTPProxy and child HTTPProxys, and the --root-namespaces flag.

I think that the problem is solvable, but needs careful thought because of the distributed, eventually-consistent nature of the Kubernetes API. Implementing first-writer-wins semantics for FQDNs will require, effectively, a distributed lock on that domain name to avoid the problems listed above.

It's possible that could be done with an annotation (like in leader election), or use the new Lease object if it's available (only since 1.16, I think). However, that violates one of the internal rules we have when building Contour's data model about not changing the Kubernetes cache. So, again, careful thought required.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rosenhouse commented 4 years ago

given that in your environment you know the ownership of each vhost

But I don't know.

The cluster operator doesn't know, and doesn't want to be bothered with the details of which vhosts are claimed by which namespace users. They just want to set up the shared ingress proxy once, and get out of the way.

davecheney commented 4 years ago

They just want to set up the shared ingress proxy once, and get out of the way.

Well, if they want to set up ingress as a global resource and let their users fight it out in a cluster maybe that'll work, maybe it won't. What is clear in the market is k8s clusters are getting smaller, so maybe this approach will work. If the cluster operator is trying to offer a k8s cluster as a shared heterogeneous resource, then probably some extra tooling on top of contour will be required.

github-actions[bot] commented 7 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] commented 6 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

projectcontour / contour