kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

kubeflow/spark-operator:2.0.0-rc.0 - installation showing config paramters of v2.0.2 (was working earlier) #2330

Open karanalang opened 4 days ago

karanalang commented 4 days ago

What happened?

I've kubeflow/spark-operator:2.0.0-rc.0 installed on GKE, and it is working fine

When i create a new helm install with the same version, it is giving me error:

Here is the Helm install command :

helm install spark-operator spark-operator/spark-operator \
  --namespace so350 \
  --set image.tag=2.0.0-rc.0 \
  --create-namespace \
  --set webhook.enable=true \
  --set webhook.port=443 \
  --set webhook.namespaceSelector="spark-webhook-enabled=true" \
  --set logLevel=debug \
  --set enableResourceQuotaEnforcement=true \
  --set webhook.failOnError=true \
  --set controller.resources.limits.cpu=100m \
  --set controller.resources.limits.memory=200Mi \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=100Mi \
  --set webhook.resources.limits.cpu=100m \
  --set webhook.resources.limits.memory=200Mi \
  --set webhook.resources.requests.cpu=50m \
  --set webhook.resources.requests.memory=100Mi \
  --set "sparkJobNamespaces={spark-apps}" 

Error in controller pod logs :

+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator controller start --zap-log-level=info --namespaces=default --controller-threads=10 --enable-ui-service=true --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-controller-lock --leader-election-lock-namespace=so350 --workqueue-ratelimiter-bucket-qps=50 --workqueue-ratelimiter-bucket-size=500 --workqueue-ratelimiter-max-delay=6h
Error: unknown flag: --workqueue-ratelimiter-bucket-qps
Usage:
  spark-operator controller start [flags]

Flags:
      --cache-sync-timeout duration                      Informer cache sync timeout. (default 30s)
      --controller-threads int                           Number of worker threads used by the SparkApplication controller. (default 10)
      --enable-batch-scheduler                           Enable batch schedulers.
      --enable-http2                                     If set, HTTP/2 will be enabled for the metrics and webhook servers
      --enable-metrics                                   Enable metrics.
      --enable-ui-service                                Enable Spark Web UI service. (default true)
      --health-probe-bind-address string                 The address the probe endpoint binds to. (default ":8081")
  -h, --help                                             help for start
      --ingress-class-name string                        Set ingressClassName for ingress resources created.
      --ingress-url-format string                        Ingress URL format.
      --kubeconfig string                                Paths to a kubeconfig. Only required if out-of-cluster.
      --leader-election                                  Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
      --leader-election-lease-duration duration          Leader election lease duration. (default 15s)
      --leader-election-lock-name string                 Name of the ConfigMap for leader election. (default "spark-operator-lock")
      --leader-election-lock-namespace string            Namespace in which to create the ConfigMap for leader election. (default "spark-operator")
      --leader-election-renew-deadline duration          Leader election renew deadline. (default 14s)
      --leader-election-retry-period duration            Leader election retry period. (default 4s)
      --metrics-bind-address string                      The address the metric endpoint binds to. Use the port :8080. If not set, it will be 0 in order to disable the metrics server (default "0")
      --metrics-endpoint string                          Metrics endpoint. (default "/metrics")
      --metrics-job-start-latency-buckets float64Slice   Buckets for the job start latency histogram. (default [30.000000,60.000000,90.000000,120.000000,150.000000,180.000000,210.000000,240.000000,270.000000,300.000000])
      --metrics-labels strings                           Labels to be added to the metrics.
      --metrics-prefix string                            Prefix for the metrics.
      --namespaces strings                               The Kubernetes namespace to manage. Will manage custom resource objects of the managed CRD types for the whole cluster if unset.
      --secure-metrics                                   If set the metrics endpoint is served securely
      --zap-devel                                        Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
      --zap-encoder encoder                              Zap log encoding (one of 'json' or 'console')
      --zap-log-level level                              Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity (default )
      --zap-stacktrace-level level                       Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
      --zap-time-encoding time-encoding                  Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

unknown flag: --workqueue-ratelimiter-bucket-qps

describing the Deployment -

(base) Karans-MacBook-Pro:~ karanalang$ kc describe deployment.apps/spark-operator-controller -n so350
Name:                   spark-operator-controller
Namespace:              so350
CreationTimestamp:      Thu, 21 Nov 2024 12:39:10 -0800
Labels:                 app.kubernetes.io/component=controller
                        app.kubernetes.io/instance=spark-operator
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=spark-operator
                        app.kubernetes.io/version=2.0.2
                        helm.sh/chart=spark-operator-2.0.2
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: spark-operator
                        meta.helm.sh/release-namespace: so350
Selector:               app.kubernetes.io/component=controller,app.kubernetes.io/instance=spark-operator,app.kubernetes.io/name=spark-operator
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/component=controller
                    app.kubernetes.io/instance=spark-operator
                    app.kubernetes.io/name=spark-operator
  Annotations:      prometheus.io/path: /metrics
                    prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  spark-operator-controller
  Containers:
   spark-operator-controller:
    Image:      docker.io/kubeflow/spark-operator:2.0.0-rc.0
    Port:       8080/TCP
    Host Port:  0/TCP
    Args:
      controller
      start
      --zap-log-level=info
      --namespaces=default
      --controller-threads=10
      --enable-ui-service=true
      --enable-metrics=true
      --metrics-bind-address=:8080
      --metrics-endpoint=/metrics
      --metrics-prefix=
      --metrics-labels=app_type
      --leader-election=true
      --leader-election-lock-name=spark-operator-controller-lock
      --leader-election-lock-namespace=so350
      --workqueue-ratelimiter-bucket-qps=50
      --workqueue-ratelimiter-bucket-size=500
      --workqueue-ratelimiter-max-delay=6h
    Limits:
      cpu:     100m
      memory:  200Mi
    Requests:
      cpu:         50m
      memory:      100Mi
    Liveness:      http-get http://:8081/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:     http-get http://:8081/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  <none>
NewReplicaSet:   spark-operator-controller-56789bb775 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  2m42s  deployment-controller  Scaled up replica set spark-operator-controller-56789bb775 to 1

seems version 2.0.2 is being installed, and that has the parameters -

--workqueue-ratelimiter-bucket-qps=50
      --workqueue-ratelimiter-bucket-size=500
      --workqueue-ratelimiter-max-delay=6h

If i remove these manually from the deployment, the controller pod is starting up.

Can you pls take a look, and let me know why this is happenning.

thanks !

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

ChenYi015 commented 3 days ago

@karanalang Try this:

helm repo update

helm install spark-operator spark-operator/spark-operator \
  --namespace so350 \
  --create-namespace \
  --version=2.0.0-rc.0 \
  --set controller.logLevel=debug \
  --set controller.resources.limits.cpu=100m \
  --set controller.resources.limits.memory=200Mi \
  --set controller.resources.requests.cpu=50m \
  --set controller.resources.requests.memory=100Mi \
  --set webhook.enable=true \
  --set webhook.logLevel=debug \
  --set webhook.port=9443 \
  --set webhook.failurePolicy=Fail \
  --set webhook.resources.limits.cpu=100m \
  --set webhook.resources.limits.memory=200Mi \
  --set webhook.resources.requests.cpu=50m \
  --set webhook.resources.requests.memory=100Mi \
  --set webhook.resourceQuotaEnforcement.enable=true \
  --set "spark.jobNamespaces={spark-apps}"