sablierapp / sablier

Start your containers on demand, shut them down automatically when there's no activity. Docker, Docker Swarm Mode and Kubernetes compatible.
https://sablierapp.dev/
GNU Affero General Public License v3.0
1.36k stars 46 forks source link

Kubernetes provider with blocking strategy fails first request #131

Closed abatilo closed 1 year ago

abatilo commented 1 year ago

Describe the bug If I set my traefik middleware to use sablier with the blocking strategy, then the first request to my application returns a 503 because traefik doesn't see any available endpoints for the application.

Context

Expected behavior I expect sablier to wait until my pods are actually ready before moving on and trying to forward requests to the downstream containers.

Additional context Here's what my middleware definition looks like.

---
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: friendlyfaces-sablier-blocking
spec:
  plugin:
    sablier:
      names: deployment_default_friendlyfaces_1
      sablierUrl: 'http://sablier.kube-system:10000'
      sessionDuration: 15m
      blocking:
        default-timeout: 1m
abatilo commented 1 year ago

I'm 99% sure I understand what's happening here. You're marking the session as ready as soon as the number of Replicas == DesiredReplicas.

The problem is that there's a race condition between the number of ready replicas and the Kubernetes endpoint controller registration.

When a pod is marked as ready, an asynchronous message is sent to the endpoint controller letting the endpoint controller know that there's a pod to register. Sablier returns that the session is ready and continues with the traefik request forwarding, but in some cases, the endpoint controller has yet to actually register the new pod IP addresses, so when traefik then refers to the Kubernetes service to look up possible endpoints, there aren't any endpoints and so traefik returns the 503.

abatilo commented 1 year ago

Okay, this doesn't seem to be correct after all. I did a lot of experimenting with things like having the service set with publishNotReadyAddresses to true. So IP addresses were available immediately. It almost seems like traefik snapshots available endpoints when the request is first made and then never updates it. If I look at traefik debug logs or I send requests to the traefik management API, there's absolutely "servers" available when the requests are still returning a 503 so I'm not sure what that's all about.

acouvreur commented 1 year ago

Hi @abatilo , this is a known issue, check out #62 I've previously tried waiting for an endpoint with an IP but concluded the same as you did.

You can try implementing the given propositions in the comments.

abatilo commented 1 year ago

Oh, so sorry about that @acouvreur. Let me close this since it's a dupe for #62