Executor pods not starting while submitting spark application from operator

sp-matrix commented 2 years ago

Hello,

The operator was installed in our openshift cluster (organization). When the example spark application (spark-examples_2.11-2.4.5.jar) was submitted with the help of operator, submitter pod and driver pod was getting created but the executor pod is not getting created and failing with the below error.

INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/nm/pods. Message: Pod "my-spark-app-1653416287945-exec-4" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[0].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=my-spark-app-1653416287945-exec-4, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "my-spark-app-1653416287945-exec-4" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure

After going through the documentation, we tried limiting the CPU and core but the issue is not getting resolved.

SparkApplication yaml: ... ... spec: driver: coreLimit: 500m cores: 0.2 executor: coreLimit: 1000m coreRequest: 0.5 cores: 1 # we can't give below 1 (float values), when we skip this parameter the default value 1 is assigned. cpuLimit: 1000m instances: 1 ...

In configmap of the driver:

spark.executor.memory=512m spark.driver.blockManager.port=7079 spark.ui.reverseProxy=true spark.executorEnv.APPLICATION_NAME=my-spark-app spark.kubernetes.container.image=quay.io/radanalyticsio/openshift-spark\:2.4-latest spark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar spark.ui.reverseProxyUrl=/ spark.kubernetes.driver.limit.cores=500m spark.kubernetes.submitInDriver=true spark.driver.memory=512m spark.submit.deployMode=cluster spark.kubernetes.driverEnv.APPLICATION_NAME=my-spark-app spark.kubernetes.executor.label.radanalytics.io/SparkApplication=my-spark-app spark.kubernetes.driver.label.radanalytics.io/SparkApplication=my-spark-app spark.executor.cores=1 spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator spark.jars.ivy=/tmp/.ivy2 spark.kubernetes.driver.pod.name=my-spark-app-1653416287945-driver spark.executor.instances=1 spark.kubernetes.namespace=nm-np spark.app.id=spark-d7cd179d47a047dd9d11811a99e1060c spark.app.name=my-spark-app spark.kubernetes.driver.label.version=2.3.0 spark.driver.cores=0.2 spark.driver.port=7078

Note: we tried starting the service 'cluster-limreq' with 4 cpu limit for executors as mentioned in READme but the issue was not resolved.

Resource Quota of my namespace allocated by my cluster manager

Name: core-resource-limits-hermes Namespace: nm-np Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio

Container cpu 25m 4 150m 400m - Container memory 25Mi 4Gi 256Mi 512Mi - Pod cpu 25m 4 - - - Pod memory 25Mi 4Gi - - -

Can anyone please help to find the reason why the executor Pod is not staring even though we request only 1 cpu from the allocated quota of 4 cpus.

Thanks!

elmiko commented 2 years ago

this looks like kubernetes is complaining about your Pod spec for the application. do you have both requests and limits defined in the pod.spec.containers[0].resoures ?

could you share the Pod or Deployment spec you are using?

sp-matrix commented 2 years ago

Thank you @elmiko for the reply.

I am getting into the operator in OC and then submitting the application in UI. While submitting, below is my complete .yaml configuration.

apiVersion: radanalytics.io/v1 kind: SparkApplication metadata: name: my-spark-app namespace: xxx-np spec: mainApplicationFile: 'local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar' mainClass: org.apache.spark.examples.SparkPi driver: cores: 0.2 coreLimit: 500m executor: coreLimit: 1000m coreRequest: 0.5 cores: 1 cpuLimit: 1000m instances: 1

Not using the requests and limits options in the .yaml.

elmiko commented 2 years ago

it's possible we have a bug in that logic for the requests/limits

sp-matrix commented 2 years ago

@elmiko Is there any workaround ?

elmiko commented 2 years ago

i need to see more details about the Pod and Deployments that are being created, knowing the SparkApplication yaml is not giving enough detail as it looks like the kubernetes API does not like the Pod definition.

would it be possible to share those records? (and please keep the formatting)

sp-matrix commented 2 years ago

@elmiko : Update on the issue. I tried to run a spark application in the same namespace via spark-submit, when the 'executor.cores' is <= 0.4, I was able to run it, If i use value beyond 0.4 cpu for executors, it is failing. When I use operator, even though I set cpuRequest as 0.4 or less than 1, the default value 1 is set to executor cores. whereas I can pass any value greater than 1, but the available cpu is 0.4, it is failing.

Operator Deployment yaml :
kind: Deployment apiVersion: apps/v1 metadata: annotations: deployment.kubernetes.io/revision: '1' resourceVersion: '1876495850' name: spark-operator uid: 503bc9cd-ebdd-45c1-b575-e02fc04b8163 creationTimestamp: '2022-05-17T22:33:56Z' generation: 2 managedFields:

manager: olm operation: Update apiVersion: apps/v1 time: '2022-05-17T22:33:57Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': .: {} 'f:olm.deployment-spec-hash': {} 'f:olm.owner': {} 'f:olm.owner.kind': {} 'f:olm.owner.namespace': {} 'f:operators.coreos.com/radanalytics-spark.agl-np': {} 'f:ownerReferences': .: {} 'k:{"uid":"0dbf1c06-ac44-4337-836a-68edb418f7d4"}': .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} 'f:spec': 'f:progressDeadlineSeconds': {} 'f:replicas': {} 'f:revisionHistoryLimit': {} 'f:selector': {} 'f:strategy': 'f:rollingUpdate': .: {} 'f:maxSurge': {} 'f:maxUnavailable': {} 'f:type': {} 'f:template': 'f:metadata': 'f:annotations': 'f:olm.operatorNamespace': {} 'f:olm.properties': {} 'f:createdAt': {} 'f:alm-examples': {} 'f:description': {} 'f:olm.operatorGroup': {} 'f:capabilities': {} .: {} 'f:containerImage': {} 'f:categories': {} 'f:certified': {} 'f:operatorframework.io/properties': {} 'f:support': {} 'f:olm.targetNamespaces': {} 'f:repository': {} 'f:labels': .: {} 'f:app.kubernetes.io/name': {} 'f:spec': 'f:containers': 'k:{"name":"spark-operator"}': .: {} 'f:env': .: {} 'k:{"name":"HTTPS_PROXY"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"HTTP_PROXY"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"NO_PROXY"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"OPERATOR_CONDITION_NAME"}': .: {} 'f:name': {} 'f:value': {} 'k:{"name":"WATCH_NAMESPACE"}': .: {} 'f:name': {} 'f:valueFrom': .: {} 'f:fieldRef': .: {} 'f:apiVersion': {} 'f:fieldPath': {} 'f:image': {} 'f:imagePullPolicy': {} 'f:name': {} 'f:resources': {} 'f:terminationMessagePath': {} 'f:terminationMessagePolicy': {} 'f:dnsPolicy': {} 'f:restartPolicy': {} 'f:schedulerName': {} 'f:securityContext': {} 'f:serviceAccount': {} 'f:serviceAccountName': {} 'f:terminationGracePeriodSeconds': {}
manager: kube-controller-manager operation: Update apiVersion: apps/v1 time: '2022-05-17T22:34:26Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:deployment.kubernetes.io/revision': {} 'f:status': 'f:availableReplicas': {} 'f:conditions': .: {} 'k:{"type":"Available"}': .: {} 'f:lastTransitionTime': {} 'f:lastUpdateTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'k:{"type":"Progressing"}': .: {} 'f:lastTransitionTime': {} 'f:lastUpdateTime': {} 'f:message': {} 'f:reason': {} 'f:status': {} 'f:type': {} 'f:observedGeneration': {} 'f:readyReplicas': {} 'f:replicas': {} 'f:updatedReplicas': {} namespace: agl-np ownerReferences:
apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion name: sparkoperator.v1.1.0 uid: 0dbf1c06-ac44-4337-836a-68edb418f7d4 controller: false blockOwnerDeletion: false labels: olm.deployment-spec-hash: d48ccd4b6 olm.owner: sparkoperator.v1.1.0 olm.owner.kind: ClusterServiceVersion olm.owner.namespace: xxx-np operators.coreos.com/radanalytics-spark.agl-np: '' spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: radanalytics-spark-operator template: metadata: creationTimestamp: null labels: app.kubernetes.io/name: radanalytics-spark-operator annotations: certified: 'false' olm.targetNamespaces: xxx-np operatorframework.io/properties: >- {"properties":[{"type":"olm.gvk","value":{"group":"radanalytics.io","kind":"SparkApplication","version":"v1"}},{"type":"olm.gvk","value":{"group":"radanalytics.io","kind":"SparkCluster","version":"v1"}},{"type":"olm.gvk","value":{"group":"radanalytics.io","kind":"SparkHistoryServer","version":"v1"}},{"type":"olm.maxOpenShiftVersion","value":"4.8"},{"type":"olm.package","value":{"packageName":"radanalytics-spark","version":"1.1.0"}}]} repository: 'https://github.com/radanalyticsio/spark-operator' support: jkremser@redhat.com olm.properties: '[{"type": "olm.maxOpenShiftVersion", "value": "4.8"}]' alm-examples: |- [ { "apiVersion": "radanalytics.io/v1", "kind": "SparkCluster", "metadata": { "name": "my-spark-cluster" }, "spec": { "worker": { "instances": "2" }, "master": { "instances": "1" } } }, { "apiVersion": "radanalytics.io/v1", "kind": "SparkApplication", "metadata": { "name": "my-spark-app" }, "spec": { "mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar", "mainClass": "org.apache.spark.examples.SparkPi", "driver": { "cores": 0.2, "coreLimit": "500m" }, "executor": { "instances": 2, "cores": 1, "coreLimit": "1000m" } } }, { "apiVersion": "radanalytics.io/v1", "kind": "SparkHistoryServer", "metadata": { "name": "my-history-server" }, "spec": { "type": "remoteStorage", "expose": true, "logDirectory": "s3a://my-history-server/", "updateInterval": 10, "retainedApplications": 50, "customImage": "quay.io/jkremser/openshift-spark:2.4.0-aws", "sparkConfiguration": [ { "name": "spark.hadoop.fs.s3a.impl", "value": "org.apache.hadoop.fs.s3a.S3AFileSystem" }, { "name": "spark.hadoop.fs.s3a.access.key", "value": "foo" }, { "name": "spark.hadoop.fs.s3a.secret.key", "value": "bar" }, { "name": "spark.hadoop.fs.s3a.endpoint", "value": "http://ceph-nano-0:8000" } ] } } ] capabilities: Deep Insights olm.operatorNamespace: agl-np containerImage: 'quay.io/radanalyticsio/spark-operator:1.1.0' createdAt: '2019-01-17 12:00:00' categories: Big Data description: >- An operator for managing the Apache Spark clusters and intelligent applications that spawn those clusters. olm.operatorGroup: radanalytics-spark spec: containers:
- name: spark-operator image: 'quay.io/radanalyticsio/spark-operator:1.1.0' env:
  - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: 'metadata.annotations[''olm.targetNamespaces'']
  - name: OPERATOR_CONDITION_NAME value: sparkoperator.v1.1.0 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: Always restartPolicy: Always terminationGracePeriodSeconds: 5 dnsPolicy: ClusterFirst serviceAccountName: spark-crd-operator serviceAccount: spark-crd-operator securityContext: {} schedulerName: default-scheduler strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 1 progressDeadlineSeconds: 600 status: observedGeneration: 2 replicas: 1 updatedReplicas: 1 readyReplicas: 1 availableReplicas: 1 conditions:
type: Available status: 'True' lastUpdateTime: '2022-05-17T22:34:26Z' lastTransitionTime: '2022-05-17T22:34:26Z' reason: MinimumReplicasAvailable message: Deployment has minimum availability.
type: Progressing status: 'True' lastUpdateTime: '2022-05-17T22:34:26Z' lastTransitionTime: '2022-05-17T22:33:56Z' reason: NewReplicaSetAvailable message: ReplicaSet "spark-operator-97bdcc5fc" has successfully progressed.

Submitter POD's Log after submitting the application

cmd: $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT --conf spark.kubernetes.namespace=xxx-np --deploy-mode cluster --conf spark.app.name=my-spark-app --conf spark.kubernetes.container.image=quay.io/radanalyticsio/openshift-spark:2.4-latest --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.driver.cores=0.5 --conf spark.kubernetes.driver.limit.cores=500m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.label.radanalytics.io/SparkApplication=my-spark-app --conf spark.kubernetes.executor.label.radanalytics.io/SparkApplication=my-spark-app --conf spark.kubernetes.driverEnv.APPLICATION_NAME=my-spark-app --conf spark.executorEnv.APPLICATION_NAME=my-spark-app --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.jars.ivy=/tmp/.ivy2 local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar && echo -e

Thanks !

elmiko commented 2 years ago

thanks for the report back @sp-matrix . it sounds like we might have a bug in the operator around converting the cpu cores value properly. i haven't looked at this code in awhile so i can't predict any sort of fix, but it certainly sounds like we are not converting the value properly.

radanalyticsio / spark-operator

Executor pods not starting while submitting spark application from operator #352