kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.37k forks source link

[BUG] Pod creation creation fails on submission with invalid resource quantities #2199

Open Cian911 opened 2 days ago

Cian911 commented 2 days ago

Description

I've been scratching my head on this one for the past few days - without any resolution.

I am in the process of testing migrating the spark operator from spark-operator-chart-1.4.6 to v2.0.1 and have come across the following issues. It seems that submission fails at the point it tries to create a driver pod - with the following error around resource quantities:

Failure executing: POST at: https://127.0.01:443/api/v1/namespaces/spark-operator/pods.
      Message: Pod in version \"v1\" cannot be handled as a Pod: quantities must match
      the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'.

Below is the full error log.

status:
  applicationState:
    errorMessage: "failed to run spark-submit: failed to run spark-submit: 24/09/27
      14:55:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your
      platform... using builtin-java classes where applicable\n24/09/27 14:55:49 INFO
      SparkKubernetesClientFactory: Auto-configuring K8S client using current context
      from users K8S config file\n24/09/27 14:55:50 INFO KerberosConfDriverFeatureStep:
      You have not specified a krb5.conf file locally or via a ConfigMap. Make sure
      that you have the krb5.conf locally on the driver image.\n24/09/27 14:55:50
      ERROR Client: Please check \"kubectl auth can-i create pod\" first. It should
      be yes.\nException in thread \"main\" io.fabric8.kubernetes.client.KubernetesClientException:
      Failure executing: POST at: https://127.0.01:443/api/v1/namespaces/spark-operator/pods.
      Message: Pod in version \"v1\" cannot be handled as a Pod: quantities must match
      the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'. Received
      status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=Pod
      in version \"v1\" cannot be handled as a Pod: quantities must match the regular
      expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', metadata=ListMeta(_continue=null,
      remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}),
      reason=BadRequest, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:518)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:535)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:703)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:92)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1108)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:92)\n\tat
      org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:153)\n\tat
      org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$6(KubernetesClientApplication.scala:256)\n\tat
      org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$6$adapted(KubernetesClientApplication.scala:250)\n\tat
      org.apache.spark.util.SparkErrorUtils.tryWithResource(SparkErrorUtils.scala:48)\n\tat
      org.apache.spark.util.SparkErrorUtils.tryWithResource$(SparkErrorUtils.scala:46)\n\tat
      org.apache.spark.util.Utils$.tryWithResource(Utils.scala:94)\n\tat org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:250)\n\tat
      org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:223)\n\tat
      org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)\n\tat
      org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)\n\tat
      org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)\n\tat org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)\n\tat
      org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)\n\tat
      org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)\n\tat org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)\nCaused
      by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
      POST at: https://127.0.0.1:443/api/v1/namespaces/spark-operator/pods.
      Message: Pod in version \"v1\" cannot be handled as a Pod: quantities must match
      the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'. Received
      status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=Pod
      in version \"v1\" cannot be handled as a Pod: quantities must match the regular
      expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', metadata=ListMeta(_continue=null,
      remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}),
      reason=BadRequest, status=Failure, additionalProperties={}).\n\tat io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:671)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:651)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:600)\n\tat
      io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:560)\n\tat
      java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)\n\tat
      java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)\n\tat
      java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)\n\tat
      io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)\n\tat
      java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat
      java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
      Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
      Source)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(Unknown
      Source)\n\tat io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)\n\tat
      java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)\n\tat
      java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
      Source)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
      Source)\n\tat java.base/java.util.concurrent.CompletableFuture.complete(Unknown
      Source)\n\tat io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:137)\n\tat
      java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat
      java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat
      java.base/java.lang.Thread.run(Unknown Source)\n24/09/27 14:55:50 INFO ShutdownHookManager:
      Shutdown hook called\n24/09/27 14:55:50 INFO ShutdownHookManager: Deleting directory
      /tmp/spark-2fe1d114-2f30-44b5-9a62-89db1478492f\n"
    state: FAILED

First thing to note on this log line: ERROR Client: Please check \"kubectl auth can-i create pod\" first. It should be yes. - the CR is using a serviceAccount that does have the appropriate permissions to perform full CRUD operations to the pods resource - just to rule that out before anyone asks.

There is no change I made to the resource values compared to spark-operator-chart-1.4.6 and v2.0.1. My driver & executor resource asks essentially look like this:

driver:
    cores: 2
    coreLimit: 8124m
    memory: 6123m
 executor:
    cores: 2
    coreLimit: 8124m
    memory: 4123m
    instances: 2

After enabling debug logs on the operator-controller, I can see that these values are correctly passed in and submitted as --conf arguments, but it fails directly after that.

This smells to me that it is an issue with spark:3.5.1.. But I am not entirely sure. I will post the full SparkApplication below for reference.

Reproduction Code [Required]

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: cian-test
  namespace: spark-operator
spec:
  driver:
    annotations:
      ad.datadoghq.com/spark-kubernetes-driver.check_names: '["prometheus"]'
      ad.datadoghq.com/spark-kubernetes-driver.init_configs: '[{}]'
      ad.datadoghq.com/spark-kubernetes-driver.instances: "\n[\n  {\n    \"prometheus_url\":
        \"http://%%host%%:8090/metrics\",\n    \"namespace\": \"spark-operator\",\n
        \   \"metrics\": [\"*\"],\n    \"tags\": []\n  }\n]\n        "
     cores: 2
    coreLimit: 8124m
    memory: 6123m
    javaOptions: -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -Dlog4j.configuration=file:/opt/log4j.properties
    nodeSelector:
      node-type: node-ssd
    podSecurityContext:
      fsGroup: 185
    serviceAccount: spark-operator
    tolerations:
    - effect: NoSchedule
      key: compute/nodegroup
      operator: Equal
      value: node-ssd
    volumeMounts:
    - mountPath: /data/spark/temp
      name: spark-data
    - mountPath: /var/lib/containerd/spark
      name: spark-local-dir-nvme
  executor:
    annotations:
      ad.datadoghq.com/spark-kubernetes-executor.check_names: '["prometheus"]'
      ad.datadoghq.com/spark-kubernetes-executor.init_configs: '[{}]'
      ad.datadoghq.com/spark-kubernetes-executor.instances: "\n[\n  {\n    \"prometheus_url\":
        \"http://%%host%%:8090/metrics\",\n    \"namespace\": \"spark-operator\",\n
        \   \"metrics\": [\"*\"],\n    \"tags\": []\n  }\n]\n        "
    cores: 2
    coreLimit: 8124m
    memory: 4123m
    instances: 2
    javaOptions: -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -Dlog4j.configuration=file:/opt/log4j.properties
    nodeSelector:
      node-type: node-ssd
    podSecurityContext:
      fsGroup: 185
    serviceAccount: spark-operator
    tolerations:
    - effect: NoSchedule
      key: compute/nodegroup
      operator: Equal
      value: node-ssd
    volumeMounts:
    - mountPath: /data/spark/temp
      name: spark-data
    - mountPath: /var/lib/containerd/spark
      name: spark-local-dir-nvme
  image: my-custom-image:v1
  mainApplicationFile: s3a://my-bucket/my-jar.jar
  mainClass: com.myClass.Cian.Application
  mode: cluster
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: /opt/spark/jars/jmx_prometheus_javaagent-0.11.0.jar
      port: 8090
  restartPolicy:
    type: Never
  sparkConf:
    spark.decommission.enabled: "true"
    spark.dynamicAllocation.shuffleTracking.enabled: "true"
    spark.eventLog.dir: s3a://my-s3-bucket/logs
    spark.eventLog.enabled: "true"
    # spark.kubernetes.memoryOverheadFactor: "0.1"
    spark.storage.decommission.enabled: "true"
    spark.storage.decommission.rddBlocks.enabled: "true"
    spark.storage.decommission.shuffleBlocks.enabled: "true"
  sparkUIOptions:
    servicePort: 4040
    servicePortName: spark-driver-ui-port
    serviceType: ""
  sparkVersion: 3.4.1
  timeToLiveSeconds: 3600
  type: Scala
  volumes:
  - name: api-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 31536000
          path: token
  - name: spark-data
    persistentVolumeClaim:
      claimName: spark-operator-efs-pvc
  - emptyDir: {}
    name: spark-local-dir-nvme

Expected behavior

Driver & Executor pods should spin up and job should start.

Actual behavior

Job submission fails.

Terminal Output Screenshot(s)

Environment & Versions

Additional context

cc: @ChenYi015 @jacobsalway

jacobsalway commented 1 day ago

Hey @Cian911 I'm not able to reproduce this locally so far with those values. From experience you get this error if coreRequest or coreLimit don't conform to the Kubernetes resource syntax. Do you have any mutating webhooks on the cluster that might mutate the request or limit fields on pod creation?