Closed tomaszduda23 closed 10 months ago
This also probably related to #62, increasing the refresh_frequency should reduce this kind of issue. But yeah, it means I detect the service available faster than Traefik does.
I could patch this to add an integration with Traefik API so I could recirect only when it is available from within Traefik, but I'm not sure if that's a good idea...
I can see a few options to solve it:
Ask client to refresh again if pod is ready for less than x seconds
I'm not sure here but it seems to be some kind of wait
hack, which means it's unpredicatable, right?
Use Traefik API for validation. It can be done only once when pod scale from 0->1.
This is probably the best solution, but also the worst. Having reverse-proxy configuration done by sablier is very bad design IMO.
Check status code in Traefik Middleware. If 503 refresh again. It could be also limited only to startup.
In the middleware, I cannot act on response by the forwarded service, I can only act "before" and either answer directly or pass the request down to the service.
Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.
You could try putting a Retry Traefik Middleware as the frontman of the middleware chain for your services. But the result would be the same I think.
You'd end up retrying failures from your service to patch the race condition on Traefik.
What do you think?
I'm not sure here but it seems to be some kind of
wait
hack, which means it's unpredicatable, right?
You could make it configure e.g. warm-up-period with default value 5 sec. User can make warm-up-period big enough. Not perfect but it would do the job.
This is probably the best solution, but also the worst. Having reverse-proxy configuration done by sablier is very bad design IMO.
sablier would just have to check readiness of reverse-proxy to handle requests. It is a little complex since you need to expose traefik API. Complexity is main concern here. Less configuration is usually better.
In the middleware, I cannot act on response by the forwarded service, I can only act "before" and either answer directly or pass the request down to the service.
Why? This is how https://github.com/traefik/traefik/blob/master/pkg/middlewares/retry/retry.go works.
Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.
E.g. you could have 3 states:
If you get 503 during starting you can assume that it is from Traefik or anything in the pipeline.
You could try putting a Retry Traefik Middleware as the frontman of the middleware chain for your services. But the result would be the same I think.
You'd end up retrying failures from your service to patch the race condition on Traefik.
I guess it won't work since retry middlewares will be attached to old router. The old router never gets service added. After configuration change Traefik creates new router and let current connection to be handled by old one. This is why it does not work for https://github.com/acouvreur/sablier/issues/62
or 1 and 3 idea could be connected. If 503 during x second after pod started let assume that it is traefik.
Why? This is how https://github.com/traefik/traefik/blob/master/pkg/middlewares/retry/retry.go works.
Oh wow, you're absolutely right. I might try to fix this on the plugin only
I guess it won't work since retry middlewares will be attached to old router. The old router never gets service added. After configuration change Traefik creates new router and let current connection to be handled by old one. This is why it does not work for https://github.com/acouvreur/sablier/issues/62
But then, how retrying will affect this issue?
Because if I retry, it will be always on the same router right?
You should ask a client to retry.
Well I see they are very aware of their limitations.
The retry middleware will fail immediatly anyway.
But I might just add this piece of code to the Traefik plugin in order to redirect to this issue.
Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.
so this is cleaver way to distinguish if 503 is triggered by Traefik or service itself https://github.com/traefik/traefik/blob/8174860770e536b4afb541e0ab13b3611a101430/pkg/middlewares/retry/retry.go#L86
Describe the bug Service Unavailable happens when page is refresh manually during scaling up. It seems that there is race condition between sablier and reloading traefik configuration. sablier check if pod is ready which happens independently from treafik configuration reloading.
Context
Expected behavior Service Unavailable should not happen.