kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

volumeMounts not working in Minikube #1179

Open jyyoo0530 opened 3 years ago

jyyoo0530 commented 3 years ago

Hi,

volumes mounted into the minikube VM hostPath, but Pod does not mount volume from hostPath.

I guess the problem is related with CRD issue. Because, if I make sample mount with k8s object kind "Pod" with nginx image, this works fine. But if object kind goes to "CRD" it seems not working (same thing happened while applying gaffer-hdfs).

Please kindly advise if there is a way to solve the problem ?

jgoeres commented 3 years ago

I can say that I have been successfully using volume mounts on minikube (type hostpath) for our dev environments. However, for volume mounts to work at all with spark operator, you need to enable the webhook, see

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#mounting-volumes

"Note that the mutating admission webhook is needed to use this feature."

When deploying the operator with Helm you can enable the webhook with "--set enableWebhook=true" on helm install. There is also a description on how to install the operator with the webhook here (but I never tried that): https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#about-the-mutating-admission-webhook

jyyoo0530 commented 3 years ago

Thanks for your review.

Unfortunately, I've already done with that command,

below is the linux command I use to launch spark operator.

helm install spark \ ./helm/sparkoperator \ --namespace spark-operator \ --set webhook.enable=true,sparkJobNamespace=spark-apps,logLevel=3

below is pod status regarding spark-operator,,,,

NAMESPACE NAME READY STATUS RESTARTS AGE spark-operator spark-spark-operator-5bb57c7c4b-zttbf 1/1 Running 0 36s spark-operator spark-spark-operator-webhook-init-w6ttl 0/1 Completed 0 36s

below is spark-operator logs,,,

++ id -u

Thanks..!

jgoeres commented 3 years ago

Just to clarify, I am merely a user of Sparkoperator, so I can only be of limited help here. What I did notice, however, is this line in your Sparkoperator log:

2021/02/27 05:29:01 http: TLS handshake error from 172.17.0.1:20840: remote error: tls: bad certificate

This points to the exact same problem that a number of people are reporting, see e.g.,

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1004 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1168

I am also observing this, but on EKS (after EKS update from 1.15 to 1.18). I tried, but couldn't reproduce it on minikube (even when using the exact same K8s version 1.18.9 that is used with EKS 1.18).

Which versions of the Sparkoperator, Minikube and Kubernetes are you using? How are you deploying the operator? Using the YAML manifests or the Helmchart? Are you using a GitOps tool like ArgoCD?

jgoeres commented 3 years ago

I have been investigating the whole topic. I do not have full knowledge of what the Sparkoperator does internally, so there is a bit of guesswork in places, but for me it starts to make sense. What seems to have broken Sparkoperator, resp. the webhook for us was our use of the Helmchart for deploying the operator. That Helmchart comes with two jobs: the init job will run a script that creates a CA and uses that CA to produce a server key and certificate. The CA cert, CA key, server key and server cert are then put into a K8s secret, which is later mounted into the sparkoperator pod. The server key and cert are used by the Sparkoperator's webhook endpoint for TLS (cause K8s API server will only speak with webhooks through TLS). In addition, when then the sparkoperator starts up, it will register itself as a webhook with K8s, and provides the CA cert as part of the webhook registration, so that the K8s API server trusts the cert presented by the webhook endpoint (otherwise, the K8s server will simply reject the certificate cause it is not signed by a CA that it trusts). So far so good. In addition to that, the Helmchart will install a cleanup job, which is run on helm upgrade and helm delete. This is done by adding a helm-specific annotation to it:

[...]
"helm.sh/hook": pre-delete, pre-upgrade
[...]

So whenever you do a helm upgrade or helm delete on the sparkoperator helm release, the cleanup job will run and delete the secret with the certs and keys, and the old (completed) init job. In the case of helm upgrade, the init job is being re-run, so a new set of certs and keys is generated and put into the secret. However, the Sparkoperator pod is NOT being restarted (from Helm's perspective, it didn't change). However, and here is where the guesswork really starts, it seems that the already running Sparkoperator immediately starts using the new server cert for its TLS endpoint. However, it does not refresh the registered webhook (and the CA cert) with K8s, so the API server has the old CA cert, and thus doesn't trust the new certificate presented to it during TLS handshake. => Bad certificate error

Interestingly, I rarely had a reason to ever run "helm upgrade" with the Sparkoperator, so I never ran into this in my local dev environment, for which I use minikube.

However, on our cloud systems, we are using the ArgoCD Gitops tooling to install various K8s applications, including the sparkoperator, and here we are also using the same Helmchart. And whenever we run our GitOps pipeline, that would trigger a sync of all ArgoCD applications, including the Sparkoperator application. An ArgoCD sync is semantically similar, but not identical to a "helm upgrade". In any case, effectively the same chain of events would be triiggered, leading to a broken webhook. And since we tend to run the GitOps pipeline many times, but never noticed the cleanup job and the re-run of the init job, it took us a while to find the root cause. And this also explains why we didn't see the problem in our previous installations - here we had not yet achieved 100% coverage of our GitOps pipeline and the Sparkoperator application had been added manually to ArgoCD. So a pipeline re-run would not affect it, no ArgoCD sync, no re-run of the cleanup and init jobs.

Our solution (or rather workaround?) is to simply remove the cleanup job from the helmchart. So the init job will never be re-run and the certificates remain stable. The only drawback I see is that since Helm didn't create the secret and the webhook itself (but the init job resp. the operator did), this will leave some residual stuff on helm delete. However, the current set of YAML descriptors also do not seem to contain the cleanup job anymore, so we are probably fine.

Further, when playing around with this to reproduce it on minikube, I noticed that a "helm upgrade" would not always break the webhook - not sure why, I guess there is some kind of a race condition involved here, or the operator sometimes doesn't pick up the new certificate.

jyyoo0530 commented 3 years ago

Thanks for your advice. I am approaching this issue based on related issues you mentioned. I also have mere knowledge on kubernetes system, so that's why I am doing all this stuff in local environment(Minikube). For my case, TLS error comes out from the first time and always, just for your reference. Hope I can solve the matter after eliminating TLS handshake issue..!

My minikube environment is, https://github.com/jyyoo0530/mirinae

And version of each apps are, sparkoperator:latest minikube: 1.17.1 kubernetes: 1.20 spark base image: 3.0.0

brettemorris commented 3 years ago

The MutatingWebhook simply does not work. I have tried a clean deployment (e.g. created an entirely new K8s cluster) via helm multiple times on a local minikube environment and I never see config maps or secrets set as environment variables on the driver pod. Have you even tested the helm deployment? I am trying this with minikube v1.18.1 on Darwin 11.2.3, which is running Kubernetes v1.20.2 on Docker 20.10.3. I run the following command to deploy the helm chart:

helm install spark-operator spark-operator/spark-operator --namespace spark --set webhook.enable=true,logLevel=3,sparkJobNamespace=spark,enableMetrics=true

And my operator yaml:

apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: test-pyspark namespace: spark spec: type: Python pythonVersion: "3" mode: cluster image: test-pyspark:latest imagePullPolicy: Never mainApplicationFile: local:///opt/pyspark-apps/main.py sparkVersion: "3.0.0" restartPolicy: type: OnFailure onFailureRetries: 3 onFailureRetryInterval: 10 onSubmissionFailureRetries: 5 onSubmissionFailureRetryInterval: 20 deps: jars:

The only way I have found to apply environment variables is to use the deprecated "envVars" option.

jgoeres commented 3 years ago

@brettemorris Did you have a look at the spark operator logs? Do yousee the "Bad certificate" error? Do you see any indications that the webhook is called? You should see log message like this whenever a new pod is created, for NON-spark pods:

I0316 09:41:16.957742      10 webhook.go:246] Serving admission request
I0316 09:41:16.958005      10 webhook.go:540] Pod <podname> in namespace <namespace> is not subject to mutation

For spark pods it should look like this:

I0316 07:28:39.061448      10 webhook.go:246] Serving admission request
I0316 07:28:39.061755      10 webhook.go:556] Pod <sparkPod> in namespace <namespace> is subject to mutation
brettemorris commented 3 years ago

@jgoeres I see the following in the operator logs:

I0317 13:31:22.606617 11 webhook.go:218] Starting the Spark admission webhook server I0317 13:31:22.614446 11 webhook.go:412] Creating a MutatingWebhookConfiguration for the Spark pod admission webhook I0317 13:31:22.619105 11 main.go:218] Starting application controller goroutines

And a little bit further down in the logs I see:

2021/03/17 13:32:46 http: TLS handshake error from 172.17.0.1:16203: remote error: tls: bad certificate

I do not, however, find "Serving admission request" anywhere in the operator logs. As I mentioned before, I've deleted my minikube cluster and started from scratch a few times and the result is always the same. Any ideas you have for getting around the problem are greatly appreciated. Thank you!

jgoeres commented 3 years ago

Above I describe what was causing the problems for us (init & cleanup jobs running repeatedly due to our GitOps pipeline and deficits in ArgoCD), see https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1179#issuecomment-788796683.

Which version of Spark operator are you using? There was an issue with how the CA certificate was created (see https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/1058), this issue has been patched for 3.0.0 (IIRC), we are still on 2.4.5. This particular issue originally did not cause any noticable problems for us (IIRC, it would only become a problem on more recent Kubernetes versions). Just yesterday, we observed it on AKS 1.19. Our solution is to "port" the patch for 3.0.0 (which only affects the gencerts.sh script) to our version.

We just took the patched gencerts.sh and put it into your 2.4.5 operator image like this

FROM <ourRegistry>/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
ADD gencerts.sh /usr/bin/gencerts.sh
RUN chmod +x /usr/bin/gencerts.sh
brettemorris commented 3 years ago

@jgoeres I am testing with latest version of the spark operator and minikube and the webhook never works. I have completely recreated my k8s cluster multiple times and retested. My colleague, however, setup the operator, without using the helm charts, in our GKE test environment and he said the webhook works there. Are you using helm to deploy the operator? It appears I will have to write my own scripts to make it easy for the development team to deploy the operator locally :(.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.