kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.7k stars 1.35k forks source link

[BUG] Spark Operator Lock identity is empty while HA #2063

Open tankim opened 3 weeks ago

tankim commented 3 weeks ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

Reproduction Code [Required]

Steps to reproduce the behavior:

Expected behavior

Terminal Output Screenshot(s)

+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ echo 0
+ echo 0
+ echo root:x:0:0:root:/root:/bin/bash
0
0
root:x:0:0:root:/root:/bin/bash
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator -v=4 -logtostderr -namespace= -enable-ui-service=true -ingress-url-format= -controller-threads=600 -resync-interval=30 -enable-batch-scheduler=false -label-selector-filter= -enable-metrics=true -metrics-labels=app_type -metrics-port=10254 -metrics-endpoint=/metrics -metrics-prefix= -enable-webhook=true -webhook-svc-namespace=dataplatform-common-dev -webhook-port=8080 -webhook-timeout=30 -webhook-svc-name=spark-operator-webhook -webhook-config-name=spark-operator-webhook-config -webhook-namespace-selector=spark-webhook-enabled=true -enable-resource-quota-enforcement=false -leader-election=true -leader-election-lock-namespace=dataplatform-common-dev -leader-election-lock-name=spark-operator-lock
F0615 02:58:37.044201      10 main.go:146] Lock identity is empty

goroutine 1 [running]:
github.com/golang/glog.Fatal(...)
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:664
main.main()
    /workspace/main.go:146 +0x1418

SIGABRT: abort
PC=0x40708e m=2 sigcode=18446744073709551610

goroutine 1 gp=0xc0000061c0 m=2 mp=0xc000092808 [running, locked to thread]:
runtime/internal/syscall.Syscall6()
    /usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36 +0xe fp=0xc0004cba88 sp=0xc0004cba80 pc=0x40708e
syscall.RawSyscall6(0xc00034e038?, 0xc0006a0120?, 0xc00060c060?, 0x2be5440?, 0x548220?, 0x2be54d8?, 0xc0004cbaf0?)
    /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 +0xd fp=0xc0004cbad0 sp=0xc0004cba88 pc=0x40706d
syscall.RawSyscall(0x2be54d8?, 0x0?, 0xc0004cbb70?, 0xc0004cbb50?)
    /usr/local/go/src/syscall/syscall_linux.go:62 +0x15 fp=0xc0004cbb18 sp=0xc0004cbad0 pc=0x48a8f5
syscall.Tgkill(0xba?, 0x0?, 0x0?)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:894 +0x25 fp=0xc0004cbb48 sp=0xc0004cbb18 pc=0x488aa5
github.com/golang/glog.abortProcess()
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog_file_linux.go:35 +0x87 fp=0xc0004cbb90 sp=0xc0004cbb48 pc=0x548387
github.com/golang/glog.ctxfatalf({0x0?, 0x0?}, 0xc000280110?, {0x1b8f1eb?, 0x411d65?}, {0xc000280110?, 0x185ca80?, 0xc000328201?})
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:647 +0x6a fp=0xc0004cbbf8 sp=0xc0004cbb90 pc=0x54606a
github.com/golang/glog.fatalf(...)
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:657
github.com/golang/glog.FatalDepth(0x1, {0xc000280110, 0x1, 0x1})
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:670 +0x57 fp=0xc0004cbc48 sp=0xc0004cbbf8 pc=0x5461f7
github.com/golang/glog.Fatal(...)
    /go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:664
main.main()
    /workspace/main.go:146 +0x1418 fp=0xc0004cbf50 sp=0xc0004cbc48 pc=0x172f418
runtime.main()
    /usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0004cbfe0 sp=0xc0004cbf50 pc=0x4404fd
runtime.goexit({})
    /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004cbfe8 sp=0xc0004cbfe0 pc=0x473721

Environment & Versions

Additional context

yuchaoran2011 commented 3 weeks ago

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

tankim commented 3 weeks ago

I fixed this with new version of helm chart version 1.4.0.

tankim commented 3 weeks ago

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

In our current workload, tens to hundreds of Spark applications are triggered simultaneously, and this number may grow to thousands in the future. In this process, if the operator pod becomes unstable, we believe that an HA setup is necessary to ensure stable operation (aiming for zero downtime). This can vary depending on the specific issues we are currently facing.