Open CarlosDominguezBecerril opened 1 year ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Kubernetes is designed to work asynchronously so I don't think we should kill the pods straight away. For example, if some other resource takes time to update, or if there's a temporary glitch with the container registry, retrying is the correct behaviour.
I agree that ideally, we should't kill the pods straight away. However, in rare cases where retrying is not effective regardless of the underlying issue, I believe that we could kill the pod directly in order to save time and resources. Probably I'm noticing this more due to having a timeout of 20 minutes
I'll +1 this just because even with a 5 minute time out, it does get a little cumbersome if you accidentally start a server with incorrect configurations i.e. mispell image, set wrong config, etc.
Maybe another solution is to introduce an alternative "max-retries" instead of "timeout"?
Proposed change
In our current setup, when we create a server, I can encounter any of the following errors:
"probe failed", "ErrImagePull", "ImagePullBackOff"
.Example:
2023-09-21T14:50:07Z [Warning] Readiness probe failed: Get "http://{my_ip}:8658/health/ready": dial tcp {my_ip}:8658: connect: connection refused
When this occurs, I have to wait for kubespawner to time out. In my case, the timeout is set to 20 minutes because occasionally I need to pull a large docker image.
It would be great to have a feature that automatically deletes the pods when certain Kubernetes errors (error messages provided by the users) are detected to avoid unnecessary waiting.
Alternative options
Who would use this feature?
Anyone that wants to kill their pods earlier if certain error messages appear.
(Optional): Suggest a solution
My simple patch for version 4.3.0 (can't update to new version due to k8s version). I think code is more explainable, but the idea is to have a variable like
self.kill_messages = ["probe failed", "ErrImagePull", "ImagePullBackOff"]
with the error messages that must kill the pod when found. If any of these errors are found make exponential_backoff raise an Error (code can definitely be improved)