kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.81k stars 1.38k forks source link

[BUG] spark-operator v1beta2-1.4.2-3.5.0 install with helm timeout #2035

Closed Jay-boo closed 2 months ago

Jay-boo commented 6 months ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

Reproduction Code [Required]

I encounter the problem while using it in Github CI/CD while giving this jobs:

  create-cluster:
    runs-on: ubuntu-latest
    steps:

      - name: Checkout current branch (full)
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Create kind cluster
        uses: helm/kind-action@v1
        with:
          config: ./kind/k8s_config/kind-config.yaml

      - name: Helm install
        run: |
          helm repo add spark-operator https://kubeflow.github.io/spark-operator
          helm search repo spark-operator
          helm repo update
          helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --debug

Steps to reproduce the behavior:

Expected behavior

Successful spark-operator install

Actual behavior

Installation is timeout after 5 min

Terminal Output Screenshot(s)

  helm repo add spark-operator https://kubeflow.github.io/spark-operator
  helm search repo spark-operator
  helm repo update
  helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --debug
  shell: /usr/bin/bash -e {0}
"spark-operator" has been added to your repositories
NAME                            CHART VERSION   APP VERSION         DESCRIPTION                                  
spark-operator/spark-operator   1.3.0           v1beta[2](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:2)-1.4.2-3.5.0  A Helm chart for Spark on Kubernetes operator
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "spark-operator" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
install.go:218: [debug] Original chart version: ""
install.go:2[3](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:3)5: [debug] CHART PATH: /home/runner/.cache/helm/repository/spark-operator-1.3.0.tgz
client.go:1[4](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:4)2: [debug] creating 1 resource(s)
client.go:142: [debug] creating 1 resource(s)
wait.go:48: [debug] beginning wait for 2 resources with timeout of 1m0s
install.go:20[5](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:5): [debug] Clearing REST mapper cache
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "my-release-spark-operator" ServiceAccount
client.go:490: [debug] Ignoring delete failure for "my-release-spark-operator" /v1, Kind=ServiceAccount: serviceaccounts "my-release-spark-operator" not found
wait.go:[6](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:6)6: [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "my-release-spark-operator" ClusterRole
client.go:490: [debug] Ignoring delete failure for "my-release-spark-operator" rbac.authorization.k8s.io/v1, Kind=ClusterRole: clusterroles.rbac.authorization.k8s.io "my-release-spark-operator" not found
wait.go:66: [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "my-release-spark-operator" ClusterRoleBinding
client.go:490: [debug] Ignoring delete failure for "my-release-spark-operator" rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io "my-release-spark-operator" not found
wait.go:66: [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "my-release-spark-operator-webhook-init" Job
client.go:490: [debug] Ignoring delete failure for "my-release-spark-operator-webhook-init" batch/v1, Kind=Job: jobs.batch "my-release-spark-operator-webhook-init" not found
wait.go:66: [debug] beginning wait for 1 resources to be deleted with timeout of 5m0s
client.go:142: [debug] creating 1 resource(s)
client.go:[7](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:8)12: [debug] Watching for changes to Job my-release-spark-operator-webhook-init with timeout of 5m0s
client.go:740: [debug] Add/Modify event for my-release-spark-operator-webhook-init: ADDED
client.go:779: [debug] my-release-spark-operator-webhook-init: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:740: [debug] Add/Modify event for my-release-spark-operator-webhook-init: MODIFIED
client.go:779: [debug] my-release-spark-operator-webhook-init: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed pre-install: 1 error occurred:
    * timed out waiting for the condition
helm.go:[8](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:9)4: [debug] failed pre-install: 1 error occurred:
    * timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
    helm.sh/helm/v3/cmd/helm/install.go:158
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.8.0/command.go:[9](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:10)83
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.8.0/command.go:[10](https://github.com/Jay-boo/InsightHoot/actions/runs/9187701502/job/25265933427#step:5:11)39
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:267
runtime.goexit
    runtime/asm_amd64.s:1650
Error: Process completed with exit code 1.

Environment & Versions

Additional context

Jay-boo commented 6 months ago

Forced to use Chart version 1.2.7 to make it work

Timoniche commented 6 months ago

Forced to use Chart version 1.2.7 to make it work

Can you please specify here the helm install command?

I have the similar problem (timeout here, mac m2)

helm install spark-operator/spark-operator --namespace spark-operator --set sparkJobNamespace=default --set webhook.enable=true --generate-name --debug

UPD: seems

helm install eee spark-operator/spark-operator --namespace spark-operator --set sparkJobNamespace=default --set webhook.enable=true --debug --version 1.2.7
Timoniche commented 6 months ago

Forced to use Chart version 1.2.7 to make it work

Fun fact that this is the only working version (1.2.5 also has timeout)

Do you have any problems with 1.2.7? For example, I don't see driver-pods creating while running spark-pi example, maybe because this is the first k8s touch from my side)

#
# Copyright 2017 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "spark:3.5.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar"
  sparkVersion: "3.5.0"
  sparkUIOptions:
    serviceLabels:
      test-label/v1: 'true'
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.5.0
    serviceAccount: spark-operator-spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.5.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
ChenYi015 commented 6 months ago

@Jay-boo Fixed in chart v1.3.2 with #2044.