Open RamakrishnaHande opened 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I have faced the same issue. On uninstall/reinstalling the spark application the status went to RUNNING
. But there is no way to identify such issues in real time. So, a prometheus metric would help.
I am using a apiVersion: "sparkoperator.k8s.io/v1beta2" I have a kind kind: SparkApplication under which there are many spark jobs. However one of the SparkApplication is under PENDING_RERUN state, this is now for more than 35 days . We want to detect this problem and give a prometheus alert. Is there any prometheus query that catches this condition? my kubernetes query gives this result.
kubectl get sparkapplication -n <>
NAME STATUS ATTEMPTS START FINISH AGE
A PENDING_RERUN 35d
I had a look at documentation and used all of these
"spark_app_count" "spark_app_submit_count" "spark_app_success_count" "spark_app_failure_count" "spark_app_running_count"
however none of these catch the PENDING_RERUN state.
thanks in advance