Closed lresende closed 1 year ago
@kevin-bates couple of questions here:
get_container_status
or use detect_launch_failure
. My understanding is that, even if detect_launch_failure
notify there is a failure, it keeps retrying until timeout which I believe it should not in this case.It looks like it's mostly working now, it can properly identify and fail quickly when a CRD fail, but it seems that there is a bug somewhere that, in case of success, is causing the kernel to restart.
What might be the better strategy here? Mostly if failures should be detected as quickly as in
get_container_status
or usedetect_launch_failure
.
The purpose of detect_launch_failure
is purely to check the status of the process that was invoked via the argv
stanza. It is a check for whether it even got off the ground. In addition, it is essentially resource manager-independent so I think get_container_status
is probably what you want.
But what are your thoughts on what might be causing the restarts?
But what are your thoughts on what might be causing the restarts?
No idea. Where is the information for being able to analyze that? (A dozen lines of nudge logging isn't sufficient.)
If the restarts are automatic, that implies that the process proxy is unable to determine the pod is alive and is likely due to k8s resources, etc (since this isn't generally reproducible).
If we're purely going by the information in #1266, I can't tell that this is even a restart - it could be the initial startup. Also, since there is no Pod IP or Status, I think that implies k8s has yet to schedule the pod. I would try to issue some describe
commands against the pod - assuming you see it running. There's just not enough information here for me to answer your question.
All is good after the latest refactoring for the logs. Thank you @kevin-bates
Properly propagate errors on Spark Operator CRD failures
Fixes #1266 Should also address #1258