kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.36k forks source link

[QUESTION] how do you actually use `envFrom` with a secret? #2134

Closed jesumyip closed 1 month ago

jesumyip commented 1 month ago

I've tried

  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "768m"
    envFrom:
      - secretRef:
          name: mysecrets

and when I run a kubectl describe pod on the driver, i don't see those env vars being picked up.

mysecrets is an opaque type secret.

To test whether the spark operator webhook is working, I tried switching the YAML config to:

  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "768m"
    env:
      - name: MY_VAR
        value: "some random value"

and that works just fine.

Am I doing this wrongly? I am using version 1.4.6 of the Helm Chart.

ChenYi015 commented 1 month ago

@jesumyip Could you provide detailed information about how to install the helm chart?

ChenYi015 commented 1 month ago

You can try out the latest version if you'd like, as this new version has fixed many problems related to webhook.

jesumyip commented 1 month ago

Hi @ChenYi015

I have tried the latest version you provided.

spark: serviceAccount: create: true name: spark-sa


- Everything created with no errors. 2 pods are running - one for `spark-operator-controller` and one for `spark-operator-webhook` 

- I then created a `SparkApplication`

apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: test-hosts namespace: xxx spec: type: Python mode: cluster image: "" imagePullPolicy: Always imagePullSecrets:

And I waited about 1minute but still no pod created in the namespace xxx. I checked the logs for the operator and webhook pods and nothing new - only the logs that were created when the 2 pods started up.'

>> kubectl get sparkapplication

NAME                STATUS   ATTEMPTS   START   FINISH   AGE
test-hosts                                        9m52s
jesumyip commented 1 month ago

Is there some permissions that is incorrectly set? But I don't see any errors logged in the 2 pods in the spark-operator namespace...

operator pod logs

++ id -u
+ uid=0
++ id -g
+ gid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator controller start --zap-log-level=debug --namespaces=default --controller-threads=10 --enable-ui-service=true --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-controller-lock --leader-election-lock-namespace=spark-operator
Spark Operator Version: v2.0.0-rc.0+unknown
Build Date: 2024-08-12T02:57:44+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.22.5
Compiler: gc
Platform: linux/amd64
2024-08-20T14:32:27.118Z        INFO    controller/start.go:251 Starting manager
2024-08-20T14:32:27.119Z        INFO    controller-runtime.metrics      server/server.go:205    Starting metrics server
2024-08-20T14:32:27.119Z        INFO    manager/server.go:50    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-08-20T14:32:27.119Z        INFO    controller-runtime.metrics      server/server.go:244    Serving metrics server  {"bindAddress": ":8080", "secure": false}
I0820 14:32:27.119306      10 leaderelection.go:250] attempting to acquire leader lease spark-operator/spark-operator-controller-lock...
I0820 14:32:27.136595      10 leaderelection.go:260] successfully acquired lease spark-operator/spark-operator-controller-lock
2024-08-20T14:32:27.136Z        DEBUG   events  recorder/recorder.go:104        spark-operator-controller-5f7497d6f5-9lxl4_ea1b7250-f6fd-42ec-9bbc-debb1a803c58 became leader     {"type": "Normal", "object": {"kind":"Lease","namespace":"spark-operator","name":"spark-operator-controller-lock","uid":"ef251560-cdef-4b4f-9080-ec9a4eecab1f","apiVersion":"coordination.k8s.io/v1","resourceVersion":"5067755"}, "reason": "LeaderElection"}
2024-08-20T14:32:27.136Z        INFO    controller/controller.go:178    Starting EventSource    {"controller": "spark-application-controller", "source": "kind source: *v1.Pod"}
2024-08-20T14:32:27.136Z        INFO    controller/controller.go:178    Starting EventSource    {"controller": "scheduled-spark-application-controller", "source": "kind source: *v1beta2.ScheduledSparkApplication"}
2024-08-20T14:32:27.136Z        INFO    controller/controller.go:178    Starting EventSource    {"controller": "spark-application-controller", "source": "kind source: *v1beta2.SparkApplication"}
2024-08-20T14:32:27.136Z        INFO    controller/controller.go:186    Starting Controller     {"controller": "scheduled-spark-application-controller"}
2024-08-20T14:32:27.136Z        INFO    controller/controller.go:186    Starting Controller     {"controller": "spark-application-controller"}
2024-08-20T14:32:27.237Z        INFO    controller/controller.go:220    Starting workers        {"controller": "spark-application-controller", "worker count": 10}
2024-08-20T14:32:27.237Z        INFO    controller/controller.go:220    Starting workers        {"controller": "scheduled-spark-application-controller", "worker count": 10}

webhook pod

++ id -u
+ uid=0
++ id -g
+ gid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=debug --namespaces=default --webhook-secret-name=spark-operator-webhook-certs --webhook-secret-namespace=spark-operator --webhook-svc-name=spark-operator-webhook-svc --webhook-svc-namespace=spark-operator --webhook-port=9443 --mutating-webhook-name=spark-operator-webhook --validating-webhook-name=spark-operator-webhook --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-webhook-lock --leader-election-lock-namespace=spark-operator
Spark Operator Version: v2.0.0-rc.0+unknown
Build Date: 2024-08-12T02:57:44+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.22.5
Compiler: gc
Platform: linux/amd64
2024-08-20T14:32:27.297Z        INFO    webhook/start.go:243    Syncing webhook secret  {"name": "spark-operator-webhook-certs", "namespace": "spark-operator"}
2024-08-20T14:32:27.772Z        INFO    webhook/start.go:257    Writing certificates    {"path": "/etc/k8s-webhook-server/serving-certs", "certificate name": "tls.crt", "key name": "tls.key"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:158  Registering a mutating webhook  {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:183   Registering webhook     {"path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:189  Registering a validating webhook {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:183   Registering webhook     {"path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:158  Registering a mutating webhook  {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:183   Registering webhook     {"path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:189  Registering a validating webhook {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:183   Registering webhook     {"path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:158  Registering a mutating webhook  {"GVK": "/v1, Kind=Pod", "path": "/mutate--v1-pod"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:183   Registering webhook     {"path": "/mutate--v1-pod"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.builder      builder/webhook.go:204  skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called       {"GVK": "/v1, Kind=Pod"}
2024-08-20T14:32:27.773Z        INFO    webhook/start.go:319    Starting manager
2024-08-20T14:32:27.773Z        INFO    controller-runtime.metrics      server/server.go:205    Starting metrics server
2024-08-20T14:32:27.773Z        INFO    manager/server.go:50    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-08-20T14:32:27.773Z        INFO    controller-runtime.webhook      webhook/server.go:191   Starting webhook server
2024-08-20T14:32:27.774Z        INFO    controller-runtime.metrics      server/server.go:244    Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-08-20T14:32:27.774Z        INFO    webhook/start.go:357    disabling http/2
2024-08-20T14:32:27.774Z        DEBUG   controller-runtime.healthz      healthz/healthz.go:60   healthz check failed    {"checker": "readyz", "error": "webhook server has not been started yet"}
2024-08-20T14:32:27.774Z        INFO    controller-runtime.healthz      healthz/healthz.go:128  healthz check failed    {"statuses": [{}]}
I0820 14:32:27.774433      10 leaderelection.go:250] attempting to acquire leader lease spark-operator/spark-operator-webhook-lock...
2024-08-20T14:32:27.774Z        INFO    controller-runtime.certwatcher  certwatcher/certwatcher.go:161  Updated current TLS certificate
2024-08-20T14:32:27.774Z        INFO    controller-runtime.webhook      webhook/server.go:242   Serving webhook server  {"host": "", "port": 9443}
2024-08-20T14:32:27.774Z        INFO    controller-runtime.certwatcher  certwatcher/certwatcher.go:115  Starting certificate watcher
I0820 14:32:27.791240      10 leaderelection.go:260] successfully acquired lease spark-operator/spark-operator-webhook-lock
2024-08-20T14:32:27.791Z        INFO    controller/controller.go:178    Starting EventSource    {"controller": "validating-webhook-configuration-controller", "source": "kind source: *v1.ValidatingWebhookConfiguration"}
2024-08-20T14:32:27.791Z        INFO    controller/controller.go:178    Starting EventSource    {"controller": "mutating-webhook-configuration-controller", "source": "kind source: *v1.MutatingWebhookConfiguration"}
2024-08-20T14:32:27.791Z        INFO    controller/controller.go:186    Starting Controller     {"controller": "validating-webhook-configuration-controller"}
2024-08-20T14:32:27.791Z        INFO    controller/controller.go:186    Starting Controller     {"controller": "mutating-webhook-configuration-controller"}
2024-08-20T14:32:27.791Z        DEBUG   events  recorder/recorder.go:104        spark-operator-webhook-75d88ff76d-549nw_aab28de5-4e4d-49ca-931c-c319031dbdba became leader        {"type": "Normal", "object": {"kind":"Lease","namespace":"spark-operator","name":"spark-operator-webhook-lock","uid":"29e67682-4868-46a9-a954-592b2ad0d6cb","apiVersion":"coordination.k8s.io/v1","resourceVersion":"5067773"}, "reason": "LeaderElection"}
2024-08-20T14:32:27.892Z        INFO    validatingwebhookconfiguration/event_handler.go:46      ValidatingWebhookConfiguration created    {"name": "spark-operator-webhook"}
2024-08-20T14:32:27.892Z        INFO    controller/controller.go:220    Starting workers        {"controller": "validating-webhook-configuration-controller", "worker count": 1}
2024-08-20T14:32:27.892Z        INFO    controller/controller.go:220    Starting workers        {"controller": "mutating-webhook-configuration-controller", "worker count": 1}
2024-08-20T14:32:27.892Z        INFO    mutatingwebhookconfiguration/event_handler.go:46        MutatingWebhookConfiguration created      {"name": "spark-operator-webhook"}
2024-08-20T14:32:27.897Z        INFO    mutatingwebhookconfiguration/controller.go:72   Updating CA bundle of MutatingWebhookConfiguration        {"name": "spark-operator-webhook"}
2024-08-20T14:32:27.897Z        INFO    validatingwebhookconfiguration/controller.go:73 Updating CA bundle of ValidatingWebhookConfiguration      {"name": "spark-operator-webhook"}
2024-08-20T14:32:27.907Z        INFO    mutatingwebhookconfiguration/event_handler.go:68        MutatingWebhookConfiguration updated      {"name": "spark-operator-webhook", "namespace": ""}
2024-08-20T14:32:27.912Z        INFO    validatingwebhookconfiguration/event_handler.go:68      ValidatingWebhookConfiguration updated    {"name": "spark-operator-webhook", "namespace": ""}
2024-08-20T14:32:27.917Z        INFO    mutatingwebhookconfiguration/controller.go:72   Updating CA bundle of MutatingWebhookConfiguration        {"name": "spark-operator-webhook"}
2024-08-20T14:32:27.917Z        INFO    validatingwebhookconfiguration/controller.go:73 Updating CA bundle of ValidatingWebhookConfiguration      {"name": "spark-operator-webhook"}
jesumyip commented 1 month ago

I also tried this values file where I modified the spark job namespaces

spark:
  jobNamespaces:
    - "xxx"

I notice in the webhook pod the startup parameter is still shown as

+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=debug --namespaces=default....

Is this the reason no SparkApplication gets created because of --namespaces=default ?

ChenYi015 commented 1 month ago

I also tried this values file where I modified the spark job namespaces

spark:
  jobNamespaces:
    - "xxx"

I notice in the webhook pod the startup parameter is still shown as

+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=debug --namespaces=default....

Is this the reason no SparkApplication gets created because of --namespaces=default ?

I have just tried to set spark.jobNamespacers to [test]:

helm install spark-operator spark-operator/spark-operator \
    --version 2.0.0-rc.0 \
    --create-namespace \
    --namespace spark-operator \
    --set 'spark.jobNamespaces={test}'

and the webhook pods logs shown that namespaces were correctly set:

+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=info --namespaces=test --webhook-secret-name=spark-operator-webhook-certs --webhook-secret-namespace=spark-operator --webhook-svc-name=spark-operator-webhook-svc --webhook-svc-namespace=spark-operator --webhook-port=9443 --mutating-webhook-name=spark-operator-webhook --validating-webhook-name=spark-operator-webhook --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-webhook-lock --leader-election-lock-namespace=spark-operator
ChenYi015 commented 1 month ago

spark: jobNamespaces:

  • "" controller: logLevel: "debug" webhook: logLevel: "debug"

spark: serviceAccount: create: true name: spark-sa

@jesumyip There is an issue related to cache settings when setting spark.jobNamespaces to all namespaces(""), and this will be fixed in PR #2123 and #2128. So you need to set job namespaces to specific namespaces instead of [""].

jesumyip commented 1 month ago

looks like the helm chart isn't compatible with kustomize. i used kustomize to install it and the namespace for the webhook isn't picked up correctly. it still gets shown as --namespaces=default.

kustomize build . --enable-helm > output.yaml

shows this:

image

interestingly enough when i modify the helm chart @ line 54 of webhook/deployment.yaml to become

        {{- with .Values.duh.fish }}
        - --namespaces={{ . | join "," }}
        {{- end }}

and i set my values file to:

duh:
  fish:
    - "xxx"
    - "test"

then the output is correct. i actually see

        - --namespaces=xxx,test

The value of default seems to be picked up from the included values.yaml file in the helm chart. I cannot seem to override it with my own values file.

jesumyip commented 1 month ago

@ChenYi015 Now when I try installing it with

helm install spark-operator spark-operator/spark-operator \
    --version 2.0.0-rc.0 \
    --create-namespace \
    --namespace spark-operator \
    --set 'spark.jobNamespaces={test,xxx}' \
    --set 'spark.serviceAccount.name=spark-sa' \
    --set 'spark.serviceAccount.create=true'

I can see the startup parameter for the webhook becomes --namespaces=test,xxx which is expected.

But when I apply the SparkApplication I can only see a svc being created in namespace test. There is no pod. There are also no additional logs in the controller and webhook pods. In the driver pod logs, I can see this:

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/bladerunner/pods/xxx-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "xxx" is forbidden: User "system:serviceaccount:test:spark-sa" cannot get resource "pods" in API group "" in the namespace "test": RBAC: role.rbac.authorization.k8s.io "spark-sa" not found.

Now if I then reinstall the helm chart with

helm install spark-operator spark-operator/spark-operator \
    --version 2.0.0-rc.0 \
    --create-namespace \
    --namespace spark-operator \
    --set 'spark.jobNamespaces={test,xxx}' \

and I have to change the service account in my SparkApplication yaml to <helmchart-releasename>-spark then the driver pod is created properly. I can also see that the driver pod has the envFrom applied correctly.

ChenYi015 commented 1 month ago

@jesumyip Thanks for reporting the issue, the spark rolebinding template did not render properly when setting spark.serviceAccount.name. I will fix it in the next release.

jesumyip commented 1 month ago

@ChenYi015 also have a look at that strangeness with spark.jobNamespaces behaviour. I cannot seem to override the value provided in the default values.yaml file.

jesumyip commented 1 month ago

@ChenYi015 Nevermind. I found out the problem with the spark.jobNamespaces behaviour. It was my mistake. My values file had two spark: sections.

ChenYi015 commented 1 month ago

/kind bug