Properly propagate errors on Spark Operator CRD failures

jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.

https://jupyter-enterprise-gateway.readthedocs.io/en/latest/

Other

615 stars 221 forks source link

Properly propagate errors on Spark Operator CRD failures #1271

Closed lresende closed 1 year ago

lresende commented 1 year ago

Properly propagate errors on Spark Operator CRD failures

Properly report failures on Spark Operator kernels
Simplify get_container_status signature to always return a string and in lowercase

Fixes #1266 Should also address #1258

lresende commented 1 year ago

@kevin-bates couple of questions here:

What might be the better strategy here? Mostly if failures should be detected as quickly as in get_container_status or use detect_launch_failure. My understanding is that, even if detect_launch_failure notify there is a failure, it keeps retrying until timeout which I believe it should not in this case.

lresende commented 1 year ago

It looks like it's mostly working now, it can properly identify and fail quickly when a CRD fail, but it seems that there is a bug somewhere that, in case of success, is causing the kernel to restart.

kevin-bates commented 1 year ago

What might be the better strategy here? Mostly if failures should be detected as quickly as in get_container_status or use detect_launch_failure.

The purpose of detect_launch_failure is purely to check the status of the process that was invoked via the argv stanza. It is a check for whether it even got off the ground. In addition, it is essentially resource manager-independent so I think get_container_status is probably what you want.

lresende commented 1 year ago

But what are your thoughts on what might be causing the restarts?

kevin-bates commented 1 year ago

But what are your thoughts on what might be causing the restarts?

No idea. Where is the information for being able to analyze that? (A dozen lines of nudge logging isn't sufficient.)

If the restarts are automatic, that implies that the process proxy is unable to determine the pod is alive and is likely due to k8s resources, etc (since this isn't generally reproducible).

If we're purely going by the information in #1266, I can't tell that this is even a restart - it could be the initial startup. Also, since there is no Pod IP or Status, I think that implies k8s has yet to schedule the pod. I would try to issue some describe commands against the pod - assuming you see it running. There's just not enough information here for me to answer your question.

lresende commented 1 year ago

All is good after the latest refactoring for the logs. Thank you @kevin-bates