kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

Intermittent `FileNotFoundException` for Prometheus Configuration in Spark Driver Pods #1921

Open potlurip opened 8 months ago

potlurip commented 8 months ago

I am experiencing an intermittent issue with the Spark Operator's monitoring feature, specifically when it's configured to expose metrics to Prometheus. Occasionally, the Spark driver fails to start due to a FileNotFoundException related to the Prometheus configuration file. The error indicates that the /etc/metrics/conf/prometheus.yaml file is not found, even though the relevant ConfigMap (spark-pi-test-prom-conf) exists within the cluster. This issue does not occur on every application deployment but happens occasionally.

Environment

Configuration

Monitoring is enabled with the following configuration in the Spark Operator, aiming to expose both driver and executor metrics to Prometheus:

monitoring:
  exposeDriverMetrics: true
  exposeExecutorMetrics: true
  prometheus:
    jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
    port: 8090

Error Observed

Caused by: java.io.FileNotFoundException: /etc/metrics/conf/prometheus.yaml (No such file or directory)

Steps to Reproduce

  1. Enable monitoring in the Spark Operator with Prometheus metrics exposure as described above.
  2. Deploy a Spark application managed by the Spark Operator.
  3. Observe the startup logs of the Spark driver pod for the intermittent FileNotFoundException related to the Prometheus configuration.

Expected Behavior

The Spark driver and executor pods should successfully mount the Prometheus configuration from the spark-pi-test-prom-conf ConfigMap, start without errors, and expose metrics to Prometheus on port 8090.

Actual Behavior

Intermittently, the Spark driver pod fails to start due to a FileNotFoundException for /etc/metrics/conf/prometheus.yaml. This suggests that the spark-pi-test-prom-conf ConfigMap is not being consistently mounted to the /etc/metrics/conf directory in the driver pod.

Troubleshooting Steps Undertaken

I am seeking insights in resolving this intermittent issue with the Prometheus configuration file not being found in the Spark driver pods. Any guidance on further troubleshooting steps, potential causes, or known solutions to ensure consistent mounting of the Prometheus configuration would be greatly appreciated.

Additional Information

Issue link - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1920

YanivKunda commented 3 months ago

We are experiencing this as well - If this is related to https://github.com/kubeflow/spark-operator/issues/1920, does it mean https://github.com/kubeflow/spark-operator/pull/2083 might resolve it?

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.