traefik / mesh

Traefik Mesh - Simpler Service Mesh
https://traefik.io/traefik-mesh
Apache License 2.0
2.03k stars 141 forks source link

Controller should check for orphan shadow services #563

Open kevinpollet opened 4 years ago

kevinpollet commented 4 years ago

Feature Request

Proposal

To make sure that shadow services exist for services created before Maesh installation or when the controller is down, at startup, the controller list the existing services and create the missing shadow services.

A nice addition could be to check that all shadow services have a corresponding service to check that there are no orphans. e.g. when a service is deleted during controller downtime.

dtomcej commented 4 years ago

In the past there was a poor mans "audit" that ran on a timer to verify all services were matched up, and that if there was a missing shadow service, trigger it as a new create event.

Did we want to go this route?

kevinpollet commented 4 years ago

I think doing this check at startup is enough. Maybe that's not necessary if the shadow service manager is idempotent (#562). This will ensure that orphan services will not disrupt the system.

One problem with the orphan shadow services is that they hold entry points for nothing.

kevinpollet commented 4 years ago

Just a thought, maybe we could use ownerReferences to make sure that shadow service are properly deleted. wdyt?

dtomcej commented 4 years ago

oooo That is a neat idea. Letting kubernetes handle it automatically? I like it.

kevinpollet commented 4 years ago

Yep, that's the idea. If a shadow service has a user service as ownerReferences, it should be garbage collected when the user service is removed but, we still have to keep the TCP and UDP tables in sync to avoid port leaks. Maybe it will help us when the TCP and UDP tables will be computed at startup.

Btw, I think that we should try to handle orphan shadow services gracefully to avoid port leaks which will disrupt the system.

jspdown commented 4 years ago

@kevinpollet I did a test with the ownerReferences property. It works great but there's a side effect. We are relying on service deletion events to update state tables but when we receive this event the shadow service can already be deleted. And so, we don't know which mapping should be removed. There are solutions though:

The first solution is the easiest to implement and the less disruptive. The second adds one more moving piece in the system and will introduce some delay. The third solution simplify the port mapping management but has a cost. We would have to build this table before each build of the topology.

I tend to prefer the 3rd option or eventually the 1st. @kevinpollet, @dtomcej Any opinion on this?

dtomcej commented 4 years ago

@jspdown I think that if we use references, we lose a bit of control over the deletion events. We have to just "trust" that things will be handled properly by k8s. We would have to run an "audit" loop to verify that k8s has in fact, removed the objects it was supposed to.

However, isn't the goal to remove the persistent state table entirely, and use the shadow services as the source of truth?

jspdown commented 4 years ago

Using ownerReferences is not an option for us. Shadow services and user services are in different namespaces. The kubernetes documentation on garbage collection says:

Cross-namespace owner references are disallowed by design. This means:

  1. Namespace-scoped dependents can only specify owners in the same namespace, and owners that are cluster-scoped.
  2. Cluster-scoped dependents can only specify cluster-scoped owners, but not namespace-scoped owners.

https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/