Helm Chart changes recreate gateway/ingress resources causing downtime

slim-bean commented 5 months ago

          For the loki chart we unfortunately had to face some downtime.

This changed https://github.com/grafana/loki/commit/79b876b65d55c54f4d532e98dc24743dea8bedec#diff-89f4fd98934eb0f277b921d45e4c223e168490c44604e454a2192d28dab1c3e2R4 forced the recreation of all the gateway resources: Deployment, Service, PodDisruptionBudget and most critical Ingress.

This is problematic for 2 reasons:

The deployment and service will immediately get traffic even though the pods are literally starting and most likely still in the ImagePull phase.
Replacing an ingress with the exact same hostname and path combination is problematic if you are running nginx ingress, as it is the case for a really good chunk of the community. This is in part because of their strict validating webhook that doesn't allow duplicate ingresses of that type. The only solution was to delete the ingress and quickly sync it, causing some downtime. Unfortunately promtail wasn't able to recover and send the accumulated log data. This is because it doesn't retry on 404 errors that happen if the ingress is deleted.

Originally posted by @tete17 in https://github.com/grafana/loki/issues/12506#issuecomment-2047017057

tete17 commented 5 months ago

Hi @slim-bean thanks for the interest.

I wonder if it is worth doing something here since if we roll it back again I would have to face some small downtime again but I am open to discussion since I can be quick with the upgrade and a few seconds of downtime are all good for me.

slim-bean commented 5 months ago

I'm trying to understand why it recreated the ingress, there wasn't actually any recent changes to that template.

What version of the chart were you upgrading from?

slim-bean commented 5 months ago

oh you explained why.... sorry still drinking coffee

slim-bean commented 5 months ago

@tete17 what is your Release.Name in helm? helm list I think shows this.

I'm wondering what we should do here, on the one hand I think it's more consistent/correct for us to use this loki.fullname

However, I don't know if there is any graceful way to make a change like this 😬

tete17 commented 5 months ago

I render out the yaml using helm template loki-simple-scalable only to be applied with argocd so my best guess is loki-simple-scalable.

The ingresses name switched from loki-gateway to loki-simple-scalable-gateway. Similar story for all the other resources.

I kind of agree that the consistency is a bonus and is good to have it.

For me the bare minimum would be to amend the release notes/upgrade guide to reference this change so people can expect it, but I understand writing into a guide that this upgrade requires downtime on the most popular ingress controller out there is though. The problem to me here is that even if you disable the validation webhook for a moment for this upgrade it is still not a 0 downtime as you need to get at least one deployment pod ready before applying the ingress or there won't be any pod serving traffic to the new service.

Alternatively if you are like me and use ArgoCD you can selectively sync/apply only the new deployment and then perform the full sync btu that is a complicated option and not one many people will be able to as selectively syncing helm pieces is probably complicated and you still need to instruct people to momentarily disable the admission webhook.

Not a single easy choice if you ask me.

slim-bean commented 5 months ago

@tete17 how did it originally fail, you applied the chart and got an error applying the ingress? were things still working at that point.

Then the problem is you have to do a delete and recreate which when you do a delete causes 404's which are very bad.

But there is a way to disable the validation webhook which would allow for the change without a delete?

Working on some updates to the upgrade guide now.

JohnFrampton commented 3 months ago

As far as I understand there could probably be a solution like this:

introduce a waitperiod for all components to use the ingress after a helm upgrade started. So all components do not communicate for e.g. 30 seconds.
sending out a message to promtail to notify promtail to buffer communication to the ingress for the amount of seconds of the waitperiod.
let the old ingress be removed by the upgrade and then introduce the new one, if they use identical/colliding configs

Yes, fore sure this is a big issue and needs introducing the "waitperiod" thing for all components which is (i suppose) big. But upgrading an ingress with its identical config must be possible automatically somehow because it will be the upgrade situation for most of the upgrades. And with this approach the loss of data may be reducible to zero.

This is just an idea but perhaps the experts might be interested :-)

grafana / loki

Helm Chart changes recreate gateway/ingress resources causing downtime #12554