helm / community

Helm community content
https://helm.sh
417 stars 175 forks source link

feat: autorecover from stuck situations #354

Open gerrnot opened 2 months ago

gerrnot commented 2 months ago

Since on the helm mailing list there was no veto and only one positive feedback from a ranger representative (see references) among the many likes in various related bug threads, I am submitting this new HIP...

lindhe commented 2 months ago

Hey, I just want to say that you are awesome for driving this HIP! 💪 I think there is a dire need for this. I've seen many people loose trust in Helm because of things getting stuck irrecoverably, it would be so nice to redeem Helm. Also, this will be of great value for all the tools that depend on Helm (like Rancher).

I'll read your document and put down some comments. On a first read, it's just formatting things or typos I'm guessing. I might contribute something better once I've digested your suggestion.

Tell me if I can help!

lindhe commented 2 months ago

The --timout parameter gets a deeper meaning. Previously the --timout parameter only had an effect on the helm process running on the respective client. After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and have an indirect impact on possible parallel processes.

I think it is really clever reusing the --timeout flag for this purpose! Nice.

lindhe commented 2 months ago

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

lindhe commented 2 months ago

I think you are on the right track with your implementation, trying to keep this lock inside the Helm release object. That way, we do not introduce new objects to the Helm ecosystem, which is a great thing.

That said, I just want to mention that there is a Lease object in Kubernetes. If we were looking for alternatives, that might be a sensible thing to use.

gerrnot commented 2 months ago

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

There is already a mechanism to detect concurrency, since a certain helm3 version. It is based on the "pending" states in the release object. While implemented with good intentions this has the consequence that one cannot recover anymore in bad cases. So I would not mention this in addition. Nevertheless, if you feel that additional benefits could be mentioned, feel free suggest an edit again, I will happily merge it if it increases the chances of the HIP.


One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

kfox1111 commented 2 months ago

kubernetes has distributed locking primitives I think? Couldn't one of those be used?

gerrnot commented 2 months ago

@kfox1111 I assume you mean the Lease object as suggest by @lindhe.

Yes, that could be used, it is just that the release state is then distributed:

  1. the traditional helm release secret
  2. the lock state in k8s

Since I consider the locked-state as an attribute of the release state, the question is whether is is worth to store it somewhere else...

Nevertheless, it could increase the backwards compatibility if we see it as something that is honored if it is there and else not.

lindhe commented 2 months ago

One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

Sounds like a bummer. I agree 1000% that server time is superior. I would go so far as to say that it is the only sensible thing to use. But I also agree that it probably belongs to another PR (unfortunately).

gerrnot commented 1 month ago

original HIP draft - storing lock in k8s secret


possible change of storing the lock as native k8s lock

If agreement is reached that such a solution is better overall, I can invest the effort of rewriting the HIP and the draft code.

benefits

drawbacks

Based on this, @kfox1111 do you still think the k8s lock (Lease) should be used? I think it might be ok because I estimate that most people will run the helm client with a user with namespace admin permissions.

So I'm only asking for a final ok from a maintainer for the Lease solution before starting further work, I personally think the benefits of the Lease slightly outweigh the alternative solution.