Open potlurip opened 8 months ago
We are experiencing this as well - If this is related to https://github.com/kubeflow/spark-operator/issues/1920, does it mean https://github.com/kubeflow/spark-operator/pull/2083 might resolve it?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am experiencing an intermittent issue with the Spark Operator's monitoring feature, specifically when it's configured to expose metrics to Prometheus. Occasionally, the Spark driver fails to start due to a
FileNotFoundException
related to the Prometheus configuration file. The error indicates that the/etc/metrics/conf/prometheus.yaml
file is not found, even though the relevant ConfigMap (spark-pi-test-prom-conf
) exists within the cluster. This issue does not occur on every application deployment but happens occasionally.Environment
Configuration
Monitoring is enabled with the following configuration in the Spark Operator, aiming to expose both driver and executor metrics to Prometheus:
Error Observed
Steps to Reproduce
FileNotFoundException
related to the Prometheus configuration.Expected Behavior
The Spark driver and executor pods should successfully mount the Prometheus configuration from the
spark-pi-test-prom-conf
ConfigMap, start without errors, and expose metrics to Prometheus on port 8090.Actual Behavior
Intermittently, the Spark driver pod fails to start due to a
FileNotFoundException
for/etc/metrics/conf/prometheus.yaml
. This suggests that thespark-pi-test-prom-conf
ConfigMap is not being consistently mounted to the/etc/metrics/conf
directory in the driver pod.Troubleshooting Steps Undertaken
spark-pi-test-prom-conf
ConfigMap in the Kubernetes cluster.I am seeking insights in resolving this intermittent issue with the Prometheus configuration file not being found in the Spark driver pods. Any guidance on further troubleshooting steps, potential causes, or known solutions to ensure consistent mounting of the Prometheus configuration would be greatly appreciated.
Additional Information
Issue link - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1920