Open infa-madhanb opened 2 years ago
Is there a solution?
Any update?
MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-a4a28f849e410e3b-conf-map" not found FailedMount
happens with default settings.
Seeing exact error mentioned by @Liftingthedata as well, even after attaching the cluster-admin ClusterRole to the ServiceAccount created by chart installation through new ClusterRoleBinding and specifying the aforementioned ServiceAccount in examples/spark-pi.yaml.
Executingkubectl describe sparkapplication spark-pi --namespace <your ns>
reveals that it is the spark-pi-driver failing. Inspecting the spark-pi-driver pod with kubectl describe pod spark-pi-driver --namespace <your ns>
shows the kubelet MountVolume.SetUp failure message directly after pulling image "gcr.io/spark-operator/spark:v3.1.1". Is this perhaps an error within the image being pulled?
Please help!
I encountered similar issue. The problem frequently happen when the spark operator is under provisioned or under high load.
@pradithya could you please share your node configuration?
Is there any solution for this? I am also facing the similar issue. I am not sure but what I think is driver pod is trying to mount configmap before it is getting created which shows that configmap not found.
Warning FailedMount 66s kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-f42632859b918eee-conf-map" not found
I've began to encounter it constantly once I've introduced environment variables to my Scala script (and properly modified K8s manifest). Haven't found out how to solve it yet.
EDIT: quite late update, but this were mainly memory leaks (insufficient resources) on my side :)
hi guys, any solution for this issue ?
Hi All,
Experienced this when we migrated to the helm chart installation of the spark operator, our volumes were mounted correctly via configmaps, but kubernetes was erroring out..
Make sure you have the following settings enabled on the helm chart below...
webhook:
# -- Enable webhook server
enable: true
namespaceSelector: "spark-webhook-enabled=true"
Then label the spark namespace (or target namespace for your spark jobs) with:
spark-webhook-enabled=true
We found that was enough to get it working.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
name: example-test
namespace: spark
spec:
schedule: "31 12 * * *"
concurrencyPolicy: Allow
template:
timeToLiveSeconds: 1200
type: Python
arguments:
- --config-file-name=/opt/spark/work-dir/config/config.ini
sparkConf:
spark.kubernetes.decommission.script: "/opt/decom.sh"
. . .
hadoopConf:
fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
. . .
mode: cluster
imagePullPolicy: Always
mainApplicationFile: local:///opt/spark/work-dir/run.py
sparkVersion: "3.2.1"
restartPolicy:
type: Never
driver:
cores: 1
coreLimit: "500m"
memory: "500m"
labels:
version: 3.2.1
serviceAccount: job-tole
volumeMounts:
- name: "config"
mountPath: "/opt/spark/work-dir/config"
executor:
cores: 1
instances: 1
memory: "500m"
labels:
version: 3.2.1
volumeMounts:
- name: "config"
mountPath: "/opt/spark/work-dir/config"
volumes:
- name: "config"
configMap:
name: "config"
items:
- key: "config.ini"
path: "config.ini"
Please note that: If you don't use the helm chart you still need to enable the webhook, otherwise the spark-operator won't be able to create the right configmaps and volume mounts on the driver and executor pods when they are spawned.
I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.
I am unable to use tolerations
as well
Does not work for me also. Followed the steps mentioned above but still getting same error.
I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.
I am unable to use
tolerations
as well
Did you manage to solve this issue?
The spark-pi job just hangs either before the driver is initialized or after driver the starts running. I do see the config-map load error in the driver's events but it does get created afterwards. Is this a resource problem? I'm running this on minikube with 4 CPU and 8GB memory !!
My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE
My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE
Thank you. I am using AKS and it worked for me as well.
I've done everything mentioned here with no success
The webhook only adds volumes if the driver/executor has a volumeMount for them: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L138-L143. The same goes for configMaps https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L335C1-L339
The code doesn't check if a driver/executor initContainer/sidecar mounts the vols. As a workaround you just have to add the volumeMounts directly to the driver/executor spec as well
Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1
values.yaml
# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart
nameOverride: spark-operator
fullnameOverride: spark-operator
image:
# -- Image repository
repository: ghcr.io/googlecloudplatform/spark-operator
# -- Image pull policy
pullPolicy: IfNotPresent
# -- if set, override the image tag whose default is the chart appVersion.
tag: "v1beta2-1.3.8-3.1.1"
imagePullSecrets:
- name: regcred
sparkJobNamespace: spark-operator
resources:
limits:
cpu: 1
memory: 512Mi
requests:
cpu: 1
memory: 512Mi
webhook:
enable: true
port: 443
namespaceSelector: "spark-webhook-enabled=true"
SparkApplication manifest
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi2
namespace: spark-operator
spec:
type: Scala
mode: cluster
image: "apache/spark:3.4.2"
imagePullPolicy: IfNotPresent
imagePullSecrets:
- regcred
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar"
sparkVersion: "3.4.2"
timeToLiveSeconds: 600
restartPolicy:
type: Never
volumes:
- name: config-vol
configMap:
name: cm-spark-extra
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.4.2
serviceAccount: airflow-next
volumeMounts:
- name: config-vol
mountPath: /mnt/cm-spark-extra
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.4.2
volumeMounts:
- name: config-vol
mountPath: /mnt/cm-spark-extra
Here is the container and volume spec of the pod being spun up
spec:
volumes:
- name: aws-iam-token
projected:
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
defaultMode: 420
- name: spark-local-dir-1
emptyDir: {}
- name: spark-conf-volume-driver
configMap:
name: spark-drv-b14d8f8f2b497a58-conf-map
items:
- key: spark.properties
path: spark.properties
mode: 420
defaultMode: 420
- name: kube-api-access-tf4pb
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: spark-kubernetes-driver
image: apache/spark:3.4.2
args:
- driver
- '--properties-file'
- /opt/spark/conf/spark.properties
- '--class'
- org.apache.spark.examples.SparkPi
- local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar
ports:
- name: driver-rpc-port
containerPort: 7078
protocol: TCP
- name: blockmanager
containerPort: 7079
protocol: TCP
- name: spark-ui
containerPort: 4040
protocol: TCP
env:
- name: SPARK_USER
value: root
- name: SPARK_APPLICATION_ID
value: spark-2d80cebdab33400b83cbfe61fd09faee
- name: SPARK_DRIVER_BIND_ADDRESS
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: SPARK_LOCAL_DIRS
value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
- name: SPARK_CONF_DIR
value: /opt/spark/conf
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: AWS_DEFAULT_REGION
value: us-east-1
- name: AWS_REGION
value: us-east-1
- name: AWS_ROLE_ARN
value: arn:aws:iam::123456:role/my-sa
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
resources:
limits:
cpu: 1200m
memory: 896Mi
requests:
cpu: '1'
memory: 896Mi
volumeMounts:
- name: spark-local-dir-1
mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
- name: spark-conf-volume-driver
mountPath: /opt/spark/conf
- name: kube-api-access-tf4pb
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
- name: aws-iam-token
readOnly: true
mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
I am at a lost here. spark-conf-volume-driver
is being set up as spark-drv-b14d8f8f2b497a58-conf-map
and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?
Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1
values.yaml
# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart nameOverride: spark-operator fullnameOverride: spark-operator image: # -- Image repository repository: ghcr.io/googlecloudplatform/spark-operator # -- Image pull policy pullPolicy: IfNotPresent # -- if set, override the image tag whose default is the chart appVersion. tag: "v1beta2-1.3.8-3.1.1" imagePullSecrets: - name: regcred sparkJobNamespace: spark-operator resources: limits: cpu: 1 memory: 512Mi requests: cpu: 1 memory: 512Mi webhook: enable: true port: 443 namespaceSelector: "spark-webhook-enabled=true"
SparkApplication manifest
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi2 namespace: spark-operator spec: type: Scala mode: cluster image: "apache/spark:3.4.2" imagePullPolicy: IfNotPresent imagePullSecrets: - regcred mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar" sparkVersion: "3.4.2" timeToLiveSeconds: 600 restartPolicy: type: Never volumes: - name: config-vol configMap: name: cm-spark-extra driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.4.2 serviceAccount: airflow-next volumeMounts: - name: config-vol mountPath: /mnt/cm-spark-extra executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.4.2 volumeMounts: - name: config-vol mountPath: /mnt/cm-spark-extra
Here is the container and volume spec of the pod being spun up
spec: volumes: - name: aws-iam-token projected: sources: - serviceAccountToken: audience: sts.amazonaws.com expirationSeconds: 86400 path: token defaultMode: 420 - name: spark-local-dir-1 emptyDir: {} - name: spark-conf-volume-driver configMap: name: spark-drv-b14d8f8f2b497a58-conf-map items: - key: spark.properties path: spark.properties mode: 420 defaultMode: 420 - name: kube-api-access-tf4pb projected: sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: name: kube-root-ca.crt items: - key: ca.crt path: ca.crt - downwardAPI: items: - path: namespace fieldRef: apiVersion: v1 fieldPath: metadata.namespace defaultMode: 420 containers: - name: spark-kubernetes-driver image: apache/spark:3.4.2 args: - driver - '--properties-file' - /opt/spark/conf/spark.properties - '--class' - org.apache.spark.examples.SparkPi - local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar ports: - name: driver-rpc-port containerPort: 7078 protocol: TCP - name: blockmanager containerPort: 7079 protocol: TCP - name: spark-ui containerPort: 4040 protocol: TCP env: - name: SPARK_USER value: root - name: SPARK_APPLICATION_ID value: spark-2d80cebdab33400b83cbfe61fd09faee - name: SPARK_DRIVER_BIND_ADDRESS valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: SPARK_LOCAL_DIRS value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea - name: SPARK_CONF_DIR value: /opt/spark/conf - name: AWS_STS_REGIONAL_ENDPOINTS value: regional - name: AWS_DEFAULT_REGION value: us-east-1 - name: AWS_REGION value: us-east-1 - name: AWS_ROLE_ARN value: arn:aws:iam::123456:role/my-sa - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token resources: limits: cpu: 1200m memory: 896Mi requests: cpu: '1' memory: 896Mi volumeMounts: - name: spark-local-dir-1 mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea - name: spark-conf-volume-driver mountPath: /opt/spark/conf - name: kube-api-access-tf4pb readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount - name: aws-iam-token readOnly: true mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent
I am at a lost here.
spark-conf-volume-driver
is being set up asspark-drv-b14d8f8f2b497a58-conf-map
and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?
I'm having the same issue on OCI. I've followed the same steps.
I also noticed this problem with a very limited CPU (resources.limits.cpu: "100m"
) and two concurrent spark apps. It's very consistent and in that case the configmap for the driver was created for only one app. After updating the resources (requests to 1 CPU, and no limit) this odd behavior disappeared.
this still happens, using spark-submit on a high number of submits sometimes occur.
@dannyeuu Try to increase the spark operator's CPU request/limit. I encountered this issue when the operator is experiencing high utilization/throttling.
Curious if this shares the root cause with another issue I saw. Does anyone see client-side throttling logs for the operator? Should look something like this:
Waited for ... due to client-side throttling, not priority and fairness ...
i faced this issue when using initContainer. if i dont use initContainer, then no error
In my case the Config Map was created but the pod was crashing with the MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-###############-conf-map" not found
.
I tried to view the logs of the pod before it failed and saw that there was another unrelated reason due to which the pod crashed.
After fixing that issue the configmap error went away.
Im facing similar issue .. the cm is created but i get the error -> -poc--bbe7876b-cbcu 3m8s Warning FailedMount pod/structured-streaming-313-1727908779-driver MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-7b6887924f646fe0-conf-map" no
I'm on version - 1.1.27 (Spark 3.1.1) .. any solutions for this ?
Any update ?
What i did.
We launch all our tasks via airflow SparkKubernetesOperator. I created a pool for all spark-kuber tasks with 2-3 slots in there and added additional sleep(40secs) after k8s resource "sparkapplication" create. And it helped to me. Still no error (2 weeks)
Our GKE cluster is running on Kubernetes version v1.21.14. Pods were running all good until yesterday, now Configmaps and Volumes are not getting Mounted.
Deployment Mode: Helm Chart
Helm Chart Version: 1.1.0
Image: v1beta2-1.2.3-3.1.1
Kubernetes Version: 1.21.14
Helm command to install: helm install spark-operator --namespace *** --set image.tag= v1beta2-1.2.3-3.1.1 --set webhook.enable=true -f values.yaml
Spark operator pod starts successfully after webhook-init pod get completed.
But my application pod using Spark Operator is unable to come up due to below error:
Events: │ │ Type Reason Age From Message │ │ ---- ------ ---- ---- ------- │ │ Normal Scheduled 49m default-scheduler Successfully assigned pod/re-driver to gke │ │ w--taints-6656b326-49of │ │ Warning FailedMount 49m kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-8c0f12839ca69805-conf-map" not found │ │ Warning FailedMount 27m (x3 over 40m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[spark-conf-volume-driver kube-api-acce │ │ s-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1[]: timed out waiting for the condition │ │ Warning FailedMount 20m (x2 over 45m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[kube-api-access-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1 spark-conf-volume-driver[]: timed out waiting for the condition