kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

Volume mounting not happening on executor -- AWS #1114

Open glitch-k8s opened 3 years ago

glitch-k8s commented 3 years ago

Hello,

I am running spark operator on AWS and somehow EFS volume is not getting mounted on executors. While, its happening perfectly fine in case of driver.

I am really stuck on this. Any help/pointers for this.

Regards, Nishant

shardulsrivastava commented 3 years ago

@nishantsh77 could you share your sparkapplication manifest.

glitch-k8s commented 3 years ago

@shardulsrivastava spark manifest file. I redacted some data..Hope its fine

apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: test-aws namespace: test labels: name: ... spec: type: Java mode: cluster image: "..." imagePullSecrets:

Please let me know if any more data points are required.

glitch-k8s commented 3 years ago

I am facing problem in AWS only. In non-cloud infra its working fine.

glitch-k8s commented 3 years ago

spark operator version :

@yuchaoran2011 Please help or shall I migrate to latest version of Spark operator ?

yuchaoran2011 commented 3 years ago

@nishantsh77 I'm not working on a Spark-related project at the moment. So unfortunately I'm not able to help you look into the issue. Do upgrade the the latest operator and see if the problem still persists.

glitch-k8s commented 3 years ago

@yuchaoran2011 Thanks. I tried even latest version of Spark operator but its not working on AWS EKS.

glitch-k8s commented 3 years ago

@liyinan926 I was wondering if you can help on this.

glitch-k8s commented 3 years ago

Getting exception in sparkoperator during volume mount..

I1214 17:00:00.494155 9 webhook.go:246] Serving admission request 2020/12/14 17:00:00 http: panic serving 10.1.22.119:57744: runtime error: index out of range [1] with length 1 goroutine 149 [running]: net/http.(conn).serve.func1(0xc000256000) /usr/local/go/src/net/http/server.go:1772 +0x139 panic(0x137a660, 0xc0006234c0) /usr/local/go/src/runtime/panic.go:973 +0x396 github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook.addVolumeMount(0xc000a80a80, 0xc000503090, 0xf, 0x0, 0xc000503080, 0xb, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook/patch.go:177 +0x57f github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook.addVolumes(0xc000a80a80, 0xc0006de000, 0x142ebf9, 0xa, 0xc0002e9ba8) /go/src/github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook/patch.go:144 +0x570 github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook.patchSparkPod(0xc000a80a80, 0xc0006de000, 0x17, 0xc0006de000, 0x0) /go/src/github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook/patch.go:52 +0xc5 github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook.mutatePods(0xc0004b13b0, 0x160b060, 0xc00032afe0, 0x7fff47d5ea51, 0xa, 0x160c4e0, 0xc0004b13b0, 0x160c4e0) /go/src/github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook/webhook.go:554 +0x591 github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook.(WebHook).serve(0xc00047e000, 0x1625460, 0xc000a62700, 0xc0006c2300) /go/src/github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/webhook/webhook.go:278 +0xb1a net/http.HandlerFunc.ServeHTTP(0xc00032aff0, 0x1625460, 0xc000a62700, 0xc0006c2300) /usr/local/go/src/net/http/server.go:2012 +0x44 net/http.(ServeMux).ServeHTTP(0xc0000bce00, 0x1625460, 0xc000a62700, 0xc0006c2300) /usr/local/go/src/net/http/server.go:2387 +0x1a5 net/http.serverHandler.ServeHTTP(0xc0002f0380, 0x1625460, 0xc000a62700, 0xc0006c2300) /usr/local/go/src/net/http/server.go:2807 +0xa3 net/http.(conn).serve(0xc000256000, 0x162b7e0, 0xc0001a3a80) /usr/local/go/src/net/http/server.go:1895 +0x86c created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2933 +0x35c I1214 17:00:00.505935 9 spark_pod_eventhandler.go:47] Pod test-1607965184791-exec-1 added in namespace aws-test.

liyinan926 commented 3 years ago

So you are trying to mount the same volume into both the driver and executor pods?

glitch-k8s commented 3 years ago

@liyinan926 Yes, I need to access same path from both driver and executor(s). If any more details are required, please let me know.

Observation:

  1. When I am running same configuration in my non-cloud DC (baremetal centos 7, 3 VM cluster with NFS setup) , then everything works fine. But its giving problem on AWS (EKS) and using EFS.
  2. On AWS : If I generate some runtime exception in driver (and catches that exception and go for long Sleep(...)), then volume mounts in executor.

Kindly guide me, how to resolve it. Is it AWS specific issue ?

jkleckner commented 3 years ago

Although this isn't an answer to your question, be aware that EFS volumes have a limited IOPS budget depending on the size of the volume. If it is used for more than configuration data, the executors might become constrained by operations to that volume.

glitch-k8s commented 3 years ago

@jkleckner Thanks for suggestion. Finally, few hours back it worked. Chart version 0.8.2 is working now. My observation is that sparkoperator works in some environments seamlessly and in some it creates lot of problems.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.