Open jmuleiro opened 2 months ago
The state is only used to be picked up upon restart, so sessions can persist after a restart.
However, the behavior you are describing looks like there's a race between sablier and traefik.
Sablier checks for readiness of your configured deployments as does Traefik before serving it. If Sablier picks the readiness before Traefik ,then it will consider them as healthy before Traefik does which leads to 503.
If your deployment does not have any healthcheck, that could also cause such issue.
I think we should align the check frequency with the traefik refresh rate.
Thank you for the quick response.
The deployment I'm having issues with does have a readiness probe. But perhaps I haven't been clear - when this intermittent issue happens, the deployment isn't updated at all. You'll constantly get 503 Service Unavailable
until you restart Sablier. Only then does Sablier detect that the deployment has zero replicas, and redirects you to the loading page.
When I first got informed of this issue I refreshed the page a lot before restarting Sablier. Once I tried that, the middleware went back to normal.
You'll constantly get 503 Service Unavailable until you restart Sablier
That makes sense, this means sablier stored the session as active, while in fact it is not.
Normally Sablier responds to external events shutting down the targeted deployment, meaning that if you shutdown the target deployment yourself, sablier will remove the session as active.
So maybe the tracked deployement is not the one you are currently being routed to ?
Can you provide more information on the actual deployment_test-devops_dummy-project-depl-node-deployment_1
instance ?
How is routed to Traefik and how is it configured with Sablier ?
Traefik is exposed publicly via a Google Compute Engine load balancer. Two Traefik IngressRoute CRDs exist for each deployment - an HTTP one with a redirectScheme
Traefik middleware (alongside two others for rate-limiting), and an HTTPS one with rate-limiting, an OAuth2 proxy middleware, and last in the middleware chain is Sablier.
None of these middlewares should be the issue here since I know everything works properly up until Sablier comes in - When I tested the endpoint in an incognito session (without the OAuth2 cookie), I was prompted to log in. The IngressRoute points towards a ClusterIP service. I checked Traefik's dashboard and every router and middleware looks OK.
I can't think of any external agents messing with the deployment replicas either. No CI deployments were made today (CI scales replicas to 0, but normally Sablier has no issue with this), the daily CronJob that scales pods to zero runs at night... We use Compute Engine spot instances for the deployments, which sometimes causes deployments to halt suddenly, but I reviewed metrics, and the node wasn't shut down when this first happened today. It may be worth noting that when the node was shut down by GCP a few hours ago, I got the same no service available
message until the pod started running on the new spot node. Traefik and Sablier run on regular virtual machines though, so not preemptible.
On a side note, while debugging I tried the latest beta version with the trace log level, and noticed the "expiring" log is printed multiple times in a row. Maybe it's something, maybe not. I saw a similar behavior with Status:ready
too:
ime="2024-09-12T21:25:54Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:04Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:05Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:14Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
time="2024-09-12T21:26:15Z" level=debug msg="expiring {Name:deployment_test-devops_dummy-project-depl-node-deployment_1 CurrentReplicas:0 DesiredReplicas:1 Status:not-ready Message:} in 15m0s"
On a side note, while debugging I tried the latest beta version with the trace log level, and noticed the "expiring" log is printed multiple times in a row. Maybe it's something, maybe not. I saw a similar behavior with Status:ready too:
When debug log is activated, this is printed every time a request is properly received and help you understand when a session duration has been updated for the last time.
By any chance, how many replicas of Sablier do you have ? It should not really matter, bit it could lead to weird behavior as this is not yet ready for clustering.
Oh alright, I didn't know about that.
Sablier is running a single replica, always has been.
I'm signing off for today, but let me know if you come up with any ideas. I'll post here if the issue happens again.
Another thought - could this have to do with Traefik's ServersTransport
CRD, or whatever a servers transport is? I just updated the CRD to the v3.0 version.
Describe the bug First of all, appreciate the effort you've put into this project. It's helped a lot. I consider it to be a mini Knative.
With that out of the way, I'll share my issue - Recently, I've upgraded Traefik to version
v3.0.4
. Since then, we've been noticing weird behaviors with Sablier. Essentially, sometimes the middleware just doesn't work. Instead of forwarding the client to the loading screen (I'm using the dynamic strategy), it forwards it directly to the Service. Then, because there are no available replicas and Traefik'sallowEmptyServices
is set to true, Traefik just returns503 Service Unavailable
with ano service available
body.I've been debugging for a short while and one thought I had is that maybe Sablier was failing to update the sessions in-memory. So I tried restarting the deployment and surprisingly, this made everything go back to normal for a while. But then the same thing happened again.
Another weird thing I noticed is that even though I'm using a
PersistentVolume
to store Sablier's state, the container only accesses the file on start, and never again. I was able tocat
the state file and noticed that even active deployments/sessions had incorrectexpiresAt
timestamps. For some reason, Sablier doesn't access the file after the server starts.I've set the log level to
trace
, but I didn't see any helpful logs either. Here's my configuration:Sablier Service (on Traefik's namespace):
And here's a sample Traefik Middleware CRD:
Traefik logs do not show any clear errors - I only see normal request logs that return
HTTP 200
.Context
v1.29.7-gke.1104000
)Expected behavior Being forwarded to Sablier's wait screen from the dynamic strategy.
Additional context Add any other context about the problem here.