Possible issues with Kubernetes and Traefik Proxy v3

jmuleiro commented 2 months ago

Describe the bug First of all, appreciate the effort you've put into this project. It's helped a lot. I consider it to be a mini Knative.

With that out of the way, I'll share my issue - Recently, I've upgraded Traefik to version v3.0.4. Since then, we've been noticing weird behaviors with Sablier. Essentially, sometimes the middleware just doesn't work. Instead of forwarding the client to the loading screen (I'm using the dynamic strategy), it forwards it directly to the Service. Then, because there are no available replicas and Traefik's allowEmptyServices is set to true, Traefik just returns 503 Service Unavailable with a no service available body.

I've been debugging for a short while and one thought I had is that maybe Sablier was failing to update the sessions in-memory. So I tried restarting the deployment and surprisingly, this made everything go back to normal for a while. But then the same thing happened again.

Another weird thing I noticed is that even though I'm using a PersistentVolume to store Sablier's state, the container only accesses the file on start, and never again. I was able to cat the state file and noticed that even active deployments/sessions had incorrect expiresAt timestamps. For some reason, Sablier doesn't access the file after the server starts.

I've set the log level to trace, but I didn't see any helpful logs either. Here's my configuration:

# ...
containers:
      - name: main
        image: acouvreur/sablier:1.8.0-beta.21
        args:
          - start
          - --provider.name=kubernetes
          - --storage.file=/data/sablier.json
          - --sessions.default-duration=30m
          - --sessions.expiration-interval=5m
          - --strategy.dynamic.custom-themes-path=/themes
          - --strategy.dynamic.default-theme=psh
          - --server.port=10000
          #* Logging levels: panic, fatal, error, warn, info, debug, trace
          - --logging.level=trace
        volumeMounts:
          - name: data
            mountPath: /data
          - name: themes
            mountPath: /themes
        readinessProbe:
          httpGet:
            path: /health
            port: 10000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 1
          successThreshold: 1
          failureThreshold: 5
        resources:
          requests:
            memory: 64Mi
            cpu: "50m"
          limits:
            memory: "128Mi"
        ports:
        - containerPort: 10000
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: sablier-data
        - name: themes
          configMap:
            name: sablier-custom-themes
            items:
              - key: psh.html
                path: psh.html

Sablier Service (on Traefik's namespace):

apiVersion: v1
kind: Service
metadata:
  name: sablier
  namespace: traefik
  labels:
    app.kubernetes.io/component: sablier
    app.kubernetes.io/instance: sablier-0
    app.kubernetes.io/version: 1.7.0
    app.kubernetes.io/name: sablier
    app.kubernetes.io/part-of: traefik
spec:
  selector:
    app.kubernetes.io/instance: sablier-0
  ports:
  - port: 10000
    targetPort: 10000
    protocol: TCP

And here's a sample Traefik Middleware CRD:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: dummy-project-depl-sablier
  namespace: test-devops
spec:
  plugin:
    sablier:
      dynamic:
        displayName: dummy-project-depl
        refreshFrequency: 10s
        showDetails: true
        theme: psh
      names: deployment_test-devops_dummy-project-depl-node-deployment_1
      sablierUrl: http://sablier:10000
      sessionDuration: 15m

Traefik logs do not show any clear errors - I only see normal request logs that return HTTP 200.

Context

Sablier version: 1.7.0, 1.8.0-beta.21
Provider: Kubernetes 1.29.7 (Google Kubernetes Engine v1.29.7-gke.1104000)
Reverse proxy: Traefik v3.0.4 (Kubernetes CRD provider)
Sablier running inside a container? Yes, as a Kubernetes Deployment

Expected behavior Being forwarded to Sablier's wait screen from the dynamic strategy.

Additional context Add any other context about the problem here.

acouvreur commented 2 months ago

The state is only used to be picked up upon restart, so sessions can persist after a restart.

However, the behavior you are describing looks like there's a race between sablier and traefik.

Sablier checks for readiness of your configured deployments as does Traefik before serving it. If Sablier picks the readiness before Traefik ,then it will consider them as healthy before Traefik does which leads to 503.

If your deployment does not have any healthcheck, that could also cause such issue.

I think we should align the check frequency with the traefik refresh rate.

jmuleiro commented 2 months ago

Thank you for the quick response.

The deployment I'm having issues with does have a readiness probe. But perhaps I haven't been clear - when this intermittent issue happens, the deployment isn't updated at all. You'll constantly get 503 Service Unavailable until you restart Sablier. Only then does Sablier detect that the deployment has zero replicas, and redirects you to the loading page.

When I first got informed of this issue I refreshed the page a lot before restarting Sablier. Once I tried that, the middleware went back to normal.

acouvreur commented 2 months ago

You'll constantly get 503 Service Unavailable until you restart Sablier

That makes sense, this means sablier stored the session as active, while in fact it is not.

Normally Sablier responds to external events shutting down the targeted deployment, meaning that if you shutdown the target deployment yourself, sablier will remove the session as active.

So maybe the tracked deployement is not the one you are currently being routed to ?

Can you provide more information on the actual deployment_test-devops_dummy-project-depl-node-deployment_1 instance ?

How is routed to Traefik and how is it configured with Sablier ?

jmuleiro commented 2 months ago

Traefik is exposed publicly via a Google Compute Engine load balancer. Two Traefik IngressRoute CRDs exist for each deployment - an HTTP one with a redirectScheme Traefik middleware (alongside two others for rate-limiting), and an HTTPS one with rate-limiting, an OAuth2 proxy middleware, and last in the middleware chain is Sablier.

None of these middlewares should be the issue here since I know everything works properly up until Sablier comes in - When I tested the endpoint in an incognito session (without the OAuth2 cookie), I was prompted to log in. The IngressRoute points towards a ClusterIP service. I checked Traefik's dashboard and every router and middleware looks OK.

I can't think of any external agents messing with the deployment replicas either. No CI deployments were made today (CI scales replicas to 0, but normally Sablier has no issue with this), the daily CronJob that scales pods to zero runs at night... We use Compute Engine spot instances for the deployments, which sometimes causes deployments to halt suddenly, but I reviewed metrics, and the node wasn't shut down when this first happened today. It may be worth noting that when the node was shut down by GCP a few hours ago, I got the same no service available message until the pod started running on the new spot node. Traefik and Sablier run on regular virtual machines though, so not preemptible.

On a side note, while debugging I tried the latest beta version with the trace log level, and noticed the "expiring" log is printed multiple times in a row. Maybe it's something, maybe not. I saw a similar behavior with Status:ready too:

ime="2024-09-12T21:25:54Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:04Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:05Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:14Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:15Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"

acouvreur commented 2 months ago

On a side note, while debugging I tried the latest beta version with the trace log level, and noticed the "expiring" log is printed multiple times in a row. Maybe it's something, maybe not. I saw a similar behavior with Status:ready too:

When debug log is activated, this is printed every time a request is properly received and help you understand when a session duration has been updated for the last time.

By any chance, how many replicas of Sablier do you have ? It should not really matter, bit it could lead to weird behavior as this is not yet ready for clustering.

jmuleiro commented 2 months ago

Oh alright, I didn't know about that.

Sablier is running a single replica, always has been. I'm signing off for today, but let me know if you come up with any ideas. I'll post here if the issue happens again. Another thought - could this have to do with Traefik's ServersTransport CRD, or whatever a servers transport is? I just updated the CRD to the v3.0 version.

sablierapp / sablier

Possible issues with Kubernetes and Traefik Proxy v3 #399