Open slim-bean opened 5 months ago
Hi @slim-bean thanks for the interest.
I wonder if it is worth doing something here since if we roll it back again I would have to face some small downtime again but I am open to discussion since I can be quick with the upgrade and a few seconds of downtime are all good for me.
I'm trying to understand why it recreated the ingress, there wasn't actually any recent changes to that template.
What version of the chart were you upgrading from?
oh you explained why.... sorry still drinking coffee
@tete17 what is your Release.Name in helm? helm list
I think shows this.
I'm wondering what we should do here, on the one hand I think it's more consistent/correct for us to use this loki.fullname
However, I don't know if there is any graceful way to make a change like this 😬
I render out the yaml using helm template loki-simple-scalable
only to be applied with argocd so my best guess is loki-simple-scalable
.
The ingresses name switched from loki-gateway
to loki-simple-scalable-gateway
. Similar story for all the other resources.
I kind of agree that the consistency is a bonus and is good to have it.
For me the bare minimum would be to amend the release notes/upgrade guide to reference this change so people can expect it, but I understand writing into a guide that this upgrade requires downtime on the most popular ingress controller out there is though. The problem to me here is that even if you disable the validation webhook for a moment for this upgrade it is still not a 0 downtime as you need to get at least one deployment pod ready before applying the ingress or there won't be any pod serving traffic to the new service.
Alternatively if you are like me and use ArgoCD you can selectively sync/apply only the new deployment and then perform the full sync btu that is a complicated option and not one many people will be able to as selectively syncing helm pieces is probably complicated and you still need to instruct people to momentarily disable the admission webhook.
Not a single easy choice if you ask me.
@tete17 how did it originally fail, you applied the chart and got an error applying the ingress? were things still working at that point.
Then the problem is you have to do a delete and recreate which when you do a delete causes 404's which are very bad.
But there is a way to disable the validation webhook which would allow for the change without a delete?
Working on some updates to the upgrade guide now.
As far as I understand there could probably be a solution like this:
Yes, fore sure this is a big issue and needs introducing the "waitperiod" thing for all components which is (i suppose) big. But upgrading an ingress with its identical config must be possible automatically somehow because it will be the upgrade situation for most of the upgrades. And with this approach the loss of data may be reducible to zero.
This is just an idea but perhaps the experts might be interested :-)
This changed https://github.com/grafana/loki/commit/79b876b65d55c54f4d532e98dc24743dea8bedec#diff-89f4fd98934eb0f277b921d45e4c223e168490c44604e454a2192d28dab1c3e2R4 forced the recreation of all the gateway resources:
Deployment
,Service
,PodDisruptionBudget
and most criticalIngress
.This is problematic for 2 reasons:
Originally posted by @tete17 in https://github.com/grafana/loki/issues/12506#issuecomment-2047017057