kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

[BUG] Service Account not working different namespace #2103

Open devscheffer opened 3 months ago

devscheffer commented 3 months ago

Description

I use the helmchart of spark operator, it is deployed at the namespace spark-operator I configure on the helmrelease sparkJobNamespaces: spark-jobs that is the namespace where I want to run the jobs. However, I'm getting this error

Name: "pyspark-pi", Namespace: "spark-jobs"
from server for: "STDIN": sparkapplications.sparkoperator.k8s.io "pyspark-pi" is forbidden: User "system:serviceaccount:spark-jobs:spark-sa" cannot get resource "sparkapplications" in API group "sparkoperator.k8s.io" in the namespace "spark-jobs"

ChenYi015 commented 3 months ago

@devscheffer Could you provide detailed information about how you install the helm chart? Is this service account spark-sa created by helm or by yourself?

devscheffer commented 2 months ago

it is created by the helm.

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  labels:
    app: spark-operator
  name: spark-operator
  namespace: spark-operator
spec:
  chart:
    spec:
      chart: spark-operator
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: spark-operator
      version: 1.4.0
  interval: 5m0s
  releaseName: spark-operator
  values:
    image:
      repository: docker.io/kubeflow/spark-operator
      pullPolicy: IfNotPresent
      tag: ""
    rbac:
      create: false
      createRole: true
      createClusterRole: true
      annotations: {}
    serviceAccounts:
      spark:
        create: true
        name: "spark-sa"
        annotations: {}
      sparkoperator:
        create: true
        name: "spark-operator-sa"
        annotations: {}
    sparkJobNamespaces:
      - spark-operator
      - team-1
    webhook:
      enable: true
      port: 443
      portName: webhook
      namespaceSelector: ""
      timeout: 30
    metrics:
      enable: true
      port: 10254
      portName: metrics
      endpoint: /metrics
      prefix: ""  
    tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
        effect: "NoSchedule"

It works when I do manually through the terminal however when I execute from airflow I get this error from server for: "STDIN": sparkapplications.sparkoperator.k8s.io "pyspark-pi2" is forbidden: User "system:serviceaccount:team-1:spark-sa" cannot get resource "sparkapplications" in API group "sparkoperator.k8s.io" in the namespace "team-1"

here is the task in airflow

spark_kpo = KubernetesPodOperator(
        task_id="kpo",
        name="spark-app-submission",
        namespace=namespace,
        image="bitnami/kubectl:1.28.11",
        cmds=["/bin/bash", "-c"],
        arguments=[f"echo '{spark_app_manifest_content}' | kubectl apply -f -"],
        in_cluster=True,
        get_logs=True,
        service_account_name=service_account_name,
        on_finish_action="keep_pod",
    )
ChenYi015 commented 2 months ago

@devscheffer The service account spark-sa actually does not have any permissions for SparkApplication, and it is used by spark driver pods. If you want to submit SparkApplication in airflow, you can configure the service account name to spark-operator-sa in KubernetesPodOperator instead. Or you can create a ServiceAccount manually and grant it with all permissions to SparkApplication.

alexz0nder commented 6 days ago

Hello. I'd like to say, that I do have the same result. I deployed helm v2.0.2 like so:

helm install spark-operator ./spark-operator \
    --version 2.0.2 \
    --create-namespace \
    --namespace spark-operator \
    --set 'spark.jobNamespaces={,airflow}' \
    --values ./values.yaml

With values.yaml for it was like:

nameOverride: ""
fullnameOverride: ""
commonLabels: {}

image:
  registry: docker.io
  repository: kubeflow/spark-operator
  tag: ""
  pullPolicy: IfNotPresent
  pullSecrets: []

controller:
  replicas: 1
  workers: 10
  logLevel: info
  uiService:
    enable: true
  uiIngress:
    enable: false
    urlFormat: ""
  batchScheduler:
    enable: true
    kubeSchedulerNames:
      - volcano
    default: ""
  serviceAccount:
    create: true
    name: ""
    annotations: {}
  rbac:
    create: true
    annotations: {}
  labels: {}
  annotations: {}
  volumes: []
  nodeSelector: {}
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - test-node
  tolerations:
    - key: "airflow"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  priorityClassName: ""
  podSecurityContext: {}
  topologySpreadConstraints: []
  env: []
  envFrom: []
  volumeMounts: []
  resources: {}
  securityContext: {}
  sidecars: []
  podDisruptionBudget:
    enable: false
    minAvailable: 1
  pprof:
    enable: false
    port: 6060
    portName: pprof
  workqueueRateLimiter:
    bucketQPS: 50
    bucketSize: 500
    maxDelay:
      enable: true
      duration: 6h

webhook:
  enable: true
  replicas: 1
  logLevel: info
  port: 9443
  portName: webhook
  failurePolicy: Fail
  timeoutSeconds: 10
  resourceQuotaEnforcement:
    enable: false
  serviceAccount:
    create: true
    name: ""
    annotations: {}
  rbac:
    create: true
    annotations: {}
  labels: {}
  annotations: {}
  sidecars: []
  volumes: []
  nodeSelector: {}
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - test-node
  tolerations:
    - key: "airflow"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  priorityClassName: ""
  podSecurityContext: {}
  topologySpreadConstraints: []
  env: []
  envFrom: []
  volumeMounts: []
  resources: {}
  securityContext: {}
  podDisruptionBudget:
    enable: false
    minAvailable: 1

spark:
  jobNamespaces:
  - "airflow"
  serviceAccount:
    create: true
    name: ""
    annotations: {}
  rbac:
    create: true
    annotations: {}

prometheus:
  metrics:
    enable: true
    port: 8080
    portName: metrics
    endpoint: /metrics
    prefix: ""
  podMonitor:
    create: true
    labels: {}
    jobLabel: spark-operator-podmonitor
    podMetricsEndpoint:
      scheme: http
      interval: 5s

And right after that, if I run a DAG from Airflow, as a result I have a POD spark-submit which fails with the next error:

Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '35324a3b-9f01-4c3b-bf56-445ea8746423', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '8bae74e0-9f4b-483f-8878-77b94fe77097', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b1662841-0cf0-4ed4-8ade-b34262bca683', 'Date': 'Fri, 18 Oct 2024 08:05:50 GMT', 'Content-Length': '483'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"sparkapplications.sparkoperator.k8s.io \"spark-submit-soyzhqvo\" is forbidden: User \"system:serviceaccount:transgran-spreads:airflow-worker\" cannot get resource \"sparkapplications/status\" in API group \"sparkoperator.k8s.io\" in the namespace \"airflow\"","reason":"Forbidden","details":{"name":"spark-submit-soyzhqvo","group":"sparkoperator.k8s.io","kind":"sparkapplications"},"code":403}

This can be fixed by adding:

With all this above, I'd like to ask why this fixes wasn't added by the helm chart ?