Service Unavailable with strategy dynamic

tomaszduda23 commented 1 year ago

Describe the bug Service Unavailable happens when page is refresh manually during scaling up. It seems that there is race condition between sablier and reloading traefik configuration. sablier check if pod is ready which happens independently from treafik configuration reloading.

Context

Sablier version: v1.2.0
Provider: kubernetes
Reverse proxy: traefik
Sablier running inside a container? Yes

Expected behavior Service Unavailable should not happen.

acouvreur commented 1 year ago

This also probably related to #62, increasing the refresh_frequency should reduce this kind of issue. But yeah, it means I detect the service available faster than Traefik does.

I could patch this to add an integration with Traefik API so I could recirect only when it is available from within Traefik, but I'm not sure if that's a good idea...

tomaszduda23 commented 1 year ago

I can see a few options to solve it:

Ask client to refresh again if pod is ready for less than x seconds
Use Traefik API for validation. It can be done only once when pod scale from 0->1.
Check status code in Traefik Middleware. If 503 refresh again. It could be also limited only to startup.

acouvreur commented 1 year ago

Ask client to refresh again if pod is ready for less than x seconds

I'm not sure here but it seems to be some kind of wait hack, which means it's unpredicatable, right?

Use Traefik API for validation. It can be done only once when pod scale from 0->1.

This is probably the best solution, but also the worst. Having reverse-proxy configuration done by sablier is very bad design IMO.

Check status code in Traefik Middleware. If 503 refresh again. It could be also limited only to startup.

In the middleware, I cannot act on response by the forwarded service, I can only act "before" and either answer directly or pass the request down to the service.

Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.

You could try putting a Retry Traefik Middleware as the frontman of the middleware chain for your services. But the result would be the same I think.

You'd end up retrying failures from your service to patch the race condition on Traefik.

What do you think?

tomaszduda23 commented 1 year ago

I'm not sure here but it seems to be some kind of wait hack, which means it's unpredicatable, right?

You could make it configure e.g. warm-up-period with default value 5 sec. User can make warm-up-period big enough. Not perfect but it would do the job.

This is probably the best solution, but also the worst. Having reverse-proxy configuration done by sablier is very bad design IMO.

sablier would just have to check readiness of reverse-proxy to handle requests. It is a little complex since you need to expose traefik API. Complexity is main concern here. Less configuration is usually better.

In the middleware, I cannot act on response by the forwarded service, I can only act "before" and either answer directly or pass the request down to the service.

Why? This is how https://github.com/traefik/traefik/blob/master/pkg/middlewares/retry/retry.go works.

Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.

E.g. you could have 3 states:

ready
not-ready
starting <- you can keep it between pod started and first request make to the endpoint

If you get 503 during starting you can assume that it is from Traefik or anything in the pipeline.

You could try putting a Retry Traefik Middleware as the frontman of the middleware chain for your services. But the result would be the same I think.

You'd end up retrying failures from your service to patch the race condition on Traefik.

I guess it won't work since retry middlewares will be attached to old router. The old router never gets service added. After configuration change Traefik creates new router and let current connection to be handled by old one. This is why it does not work for https://github.com/acouvreur/sablier/issues/62

tomaszduda23 commented 1 year ago

or 1 and 3 idea could be connected. If 503 during x second after pod started let assume that it is traefik.

acouvreur commented 1 year ago

Why? This is how https://github.com/traefik/traefik/blob/master/pkg/middlewares/retry/retry.go works.

Oh wow, you're absolutely right. I might try to fix this on the plugin only

acouvreur commented 1 year ago

I guess it won't work since retry middlewares will be attached to old router. The old router never gets service added. After configuration change Traefik creates new router and let current connection to be handled by old one. This is why it does not work for https://github.com/acouvreur/sablier/issues/62

But then, how retrying will affect this issue?

Because if I retry, it will be always on the same router right?

tomaszduda23 commented 1 year ago

You should ask a client to retry.

acouvreur commented 1 year ago

Well I see they are very aware of their limitations.

https://github.com/traefik/traefik/blob/8174860770e536b4afb541e0ab13b3611a101430/pkg/middlewares/retry/retry.go#L187-L194

The retry middleware will fail immediatly anyway.

But I might just add this piece of code to the Traefik plugin in order to redirect to this issue.

tomaszduda23 commented 1 year ago

Still, If I could catch the 503, I'm not sure how to tell if the 503 is really triggered by Traefik or the service itself.

so this is cleaver way to distinguish if 503 is triggered by Traefik or service itself https://github.com/traefik/traefik/blob/8174860770e536b4afb541e0ab13b3611a101430/pkg/middlewares/retry/retry.go#L86

sablierapp / sablier

Service Unavailable with strategy dynamic #143