kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

spark-operator v2.0.2 - listen tcp :443: bind: permission denied #2331

Open karanalang opened 19 hours ago

karanalang commented 19 hours ago

What happened?

Command -

helm upgrade --install spark-operator spark-operator/spark-operator \
  --namespace so350 \
  --set image.tag=2.0.2 \
  --create-namespace \
  --set webhook.enable=true \
  --set webhook.port=443 \
  --set webhook.namespaceSelector="spark-webhook-enabled=true" \
  --set webhook.containerSecurityContext.privileged=true \
  --set webhook.containerSecurityContext.capabilities.add[0]=NET_BIND_SERVICE \
  --set logLevel=debug \
  --set enableResourceQuotaEnforcement=true \
  --set webhook.failOnError=true \
  --set controller.resources.limits.cpu=100m \
  --set controller.resources.limits.memory=200Mi \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=100Mi \
  --set webhook.resources.limits.cpu=100m \
  --set webhook.resources.limits.memory=200Mi \
  --set webhook.resources.requests.cpu=50m \
  --set webhook.resources.requests.memory=100Mi \
  --set "sparkJobNamespaces={spark-apps}" \
  --set webhook.containerSecurityContext.runAsUser=0

spark-controller pod is started but webhook pod is failing -

NAME                                             READY   STATUS    RESTARTS      AGE
pod/spark-operator-controller-688c7c9955-tkdpf   1/1     Running   0             3m15s
pod/spark-operator-webhook-567bd94f66-tg567      0/1     Error     5 (94s ago)   3m15s

NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/spark-operator-webhook-svc   ClusterIP   10.108.242.219   <none>        443/TCP   3m15s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/spark-operator-controller   1/1     1            1           3m15s
deployment.apps/spark-operator-webhook      0/1     1            0           3m15s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/spark-operator-controller-688c7c9955   1         1         1       3m15s
replicaset.apps/spark-operator-webhook-567bd94f66      1         1         0       3m15s

Logs from webhook pod -

(base) Karans-MacBook-Pro:~ karanalang$ kc logs -f pod/spark-operator-webhook-567bd94f66-tg567  -n so350
++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator webhook start --zap-log-level=info --namespaces=default --webhook-secret-name=spark-operator-webhook-certs --webhook-secret-namespace=so350 --webhook-svc-name=spark-operator-webhook-svc --webhook-svc-namespace=so350 --webhook-port=443 --mutating-webhook-name=spark-operator-webhook --validating-webhook-name=spark-operator-webhook --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-webhook-lock --leader-election-lock-namespace=so350
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID: 
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2024-11-21T20:56:37.838Z    INFO    webhook/start.go:244    Syncing webhook secret  {"name": "spark-operator-webhook-certs", "namespace": "so350"}
2024-11-21T20:56:37.936Z    INFO    webhook/start.go:258    Writing certificates    {"path": "/etc/k8s-webhook-server/serving-certs", "certificate name": "tls.crt", "key name": "tls.key"}
2024-11-21T20:56:38.036Z    INFO    controller-runtime.builder  builder/webhook.go:158  Registering a mutating webhook  {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z    INFO    controller-runtime.webhook  webhook/server.go:183   Registering webhook {"path": "/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z    INFO    controller-runtime.builder  builder/webhook.go:189  Registering a validating webhook    {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=SparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z    INFO    controller-runtime.webhook  webhook/server.go:183   Registering webhook {"path": "/validate-sparkoperator-k8s-io-v1beta2-sparkapplication"}
2024-11-21T20:56:38.036Z    INFO    controller-runtime.builder  builder/webhook.go:158  Registering a mutating webhook  {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.webhook  webhook/server.go:183   Registering webhook {"path": "/mutate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.builder  builder/webhook.go:189  Registering a validating webhook    {"GVK": "sparkoperator.k8s.io/v1beta2, Kind=ScheduledSparkApplication", "path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.webhook  webhook/server.go:183   Registering webhook {"path": "/validate-sparkoperator-k8s-io-v1beta2-scheduledsparkapplication"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.builder  builder/webhook.go:158  Registering a mutating webhook  {"GVK": "/v1, Kind=Pod", "path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.webhook  webhook/server.go:183   Registering webhook {"path": "/mutate--v1-pod"}
2024-11-21T20:56:38.037Z    INFO    controller-runtime.builder  builder/webhook.go:204  skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called {"GVK": "/v1, Kind=Pod"}
2024-11-21T20:56:38.037Z    INFO    webhook/start.go:320    Starting manager
2024-11-21T20:56:38.038Z    INFO    controller-runtime.metrics  server/server.go:205    Starting metrics server
2024-11-21T20:56:38.038Z    INFO    controller-runtime.metrics  server/server.go:244    Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-11-21T20:56:38.039Z    INFO    manager/server.go:50    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.039Z    INFO    controller-runtime.webhook  webhook/server.go:191   Starting webhook server
2024-11-21T20:56:38.039Z    INFO    webhook/start.go:358    disabling http/2
2024-11-21T20:56:38.039Z    INFO    controller-runtime.certwatcher  certwatcher/certwatcher.go:161  Updated current TLS certificate
2024-11-21T20:56:38.040Z    INFO    controller-runtime.certwatcher  certwatcher/certwatcher.go:115  Starting certificate watcher
2024-11-21T20:56:38.040Z    INFO    manager/internal.go:534 Stopping and waiting for non leader election runnables
2024-11-21T20:56:38.040Z    INFO    manager/internal.go:538 Stopping and waiting for leader election runnables
2024-11-21T20:56:38.040Z    INFO    manager/internal.go:546 Stopping and waiting for caches
2024-11-21T20:56:38.040Z    INFO    manager/internal.go:550 Stopping and waiting for webhooks
2024-11-21T20:56:38.040Z    INFO    manager/internal.go:553 Stopping and waiting for HTTP servers
I1121 20:56:38.040581      10 leaderelection.go:250] attempting to acquire leader lease so350/spark-operator-webhook-lock...
2024-11-21T20:56:38.041Z    INFO    manager/server.go:43    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2024-11-21T20:56:38.041Z    INFO    controller-runtime.metrics  server/server.go:251    Shutting down metrics server with timeout of 1 minute
2024-11-21T20:56:38.041Z    INFO    manager/internal.go:557 Wait completed, proceeding to shutdown the manager
E1121 20:56:38.041688      10 leaderelection.go:332] error retrieving resource lock so350/spark-operator-webhook-lock: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/so350/leases/spark-operator-webhook-lock": context canceled
2024-11-21T20:56:38.041Z    ERROR   webhook/start.go:322    Failed to start manager {"error": "listen tcp :443: bind: permission denied"}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
    /workspace/cmd/operator/webhook/start.go:322
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
    /workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
    /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
    /workspace/cmd/main.go:27
runtime.main
    /usr/local/go/src/runtime/proc.go:272

Pls note - I'd installed v2.0.0-rc.0, it was working fine .. however. running into issues with v2.0.2

Pls help with this.

thanks!

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

Additional context

No response

Impacted by this bug?

Give it a πŸ‘ We prioritize the issues with most πŸ‘

ChenYi015 commented 14 hours ago

@karanalang Please use a non-privileged webhook port (default to 9443) if possible, or you will need to run as root or modify the security context for that we have removed all the capabilities to enhance the container security.

jacobsalway commented 3 hours ago

Worth noting I think you want webhook.securityContext rather than webhook.containerSecurityContext. I was able to successfully run on Kind with your Helm values once I changed that.

https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/templates/webhook/deployment.yaml#L113-L116