kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Toleration is not passing to Driver and Executor pods #1879

Closed amiyajena1993 closed 1 month ago

amiyajena1993 commented 12 months ago

spark operator image version :- v1beta2-1.3.7-3.1.1 K8S version :- 1.26

-----------------------------------:

  # -- Enable webhook server
  enable: true
  # -- Webhook service port
  port: 443
  # -- The webhook server will only operate on namespaces with this label, specified in the form key1=value1,key2=value2.
  # Empty string (default) will operate on all namespaces
  namespaceSelector: app=ns-rm-dp-spark-operator ,app=ns-rm-dp-spark-jobs
    #namespaceSelector: "spark-webhook-enabled=true"
  # -- The annotations applied to the cleanup job, required for helm lifecycle hooks
  cleanupAnnotations:
    "helm.sh/hook": pre-delete, pre-upgrade
    "helm.sh/hook-delete-policy": hook-succeeded

Hello Team , i have enabled webhook in my spark operator pod despite that i am unable to apply toleration in my driver and executor pod .

Spark Job Yaml :-

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: sparkpi-test9227
  namespace: ns-rm-dp-spark-jobs
spec:
  type: Scala
  mode: cluster
  image: " "
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  arguments:
  - "30000"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    coreLimit: "4000m"
    memory: "4g"
    labels:
      version: 3.1.1
    serviceAccount: svc-rm-dp-spark
    tolerations:
    - effect: NoSchedule
      key: pfamily
      operator: Equal
      value: "aiplatform"
  executor:
    tolerations:
        - key: "pfamily"
          operator: "Equal"
          value: "aiplatform"
          effect: "NoSchedule"
    cores: 1
    coreLimit: "4000m"
    instances: 3
    memory: "4g"
    labels:
      version: 3.1.1
  sparkConf:
   "spark.executor.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"
   "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"

~
~
Issue :- Driver pod goes to pending state , due to missing of tolerations .

error :- 0/50 nodes are available: 1 node(s) were unschedulable, 10 node(s) had untolerated taint {pfamily: symops}, 13 node(s) had untolerated taint {pfamily: aiplatform}, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 6 node(s) had untolerated taint {pfamily: penableric}, 9 node(s) had untolerated taint {pfamily: symplanbizsec}, 9 node(s) had untolerated taint {pfamily: symplatform}. preemption: 0/50 nodes are available: 50 Preemption is not helpful for scheduling..

piermotte commented 10 months ago

I have the same problem, no tolerations on driver pod

zy-wiser commented 10 months ago

I see same issues as SparkApplication as even toleration exist in describe SparkApplication but the spark submit command is ignoring it and the driver/executor pod not attaching toleration

ahululu commented 8 months ago

I have the same problem too..

biljicmarko commented 7 months ago

Did you find any workaround to this problem?

pasdoy commented 6 months ago

I ran into the same issue and enabling the webhooks fixed the problem.

values: {
  webhook: {
    enable: true,
  },
}

CHART spark-operator-1.2.14
APP VERSION v1beta2-1.4.5-3.5.0

imtzer commented 5 months ago

As @pasdoy mentioned, tolerations is patch in webhooks, enable it in your values.yaml, this issuse can be closed

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

justas200 commented 2 months ago

Jesus, why is this information not in the documentation??

ChenYi015 commented 1 month ago

Jesus, why is this information not in the documentation??

@justas200 Sorry for the docs. But I remember that a note is included in the docs to remind users to enable webhook when using tolerations.