volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.21k stars 964 forks source link

Operation: [delete] for kind: [PodGroup] with name: [spark-f0991120cd9940f2ba6aa76783cc6b02-podgroup] in namespace: [default] failed #2835

Open LU1371046 opened 1 year ago

LU1371046 commented 1 year ago

在k8s master节点使用如下提交: ./bin/spark-submit --master k8s://https://172.31.186.86:6443 --deploy-mode cluster --name spark-pi3 --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.namespace=default --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=artifacts.iflytek.com/docker-private/odeon-k8s/spark:v3.0.0 --conf spark.kubernetes.authenticate.submission.oauthToken=eyJhbGciOiJSUzI1NiIsImtpZCI6InRZXzZvM2VWNnozQ0JRQUVMZktnWW1oT2I2NTd0endYR040UWtKNkp6Nk0ifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InNwYXJrLXRva2VuLXFqcWw5Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InNwYXJrIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzNmMDAwZTktYjY2Yi00MzExLThhYzktMTYxMjdkZjJmNTMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OmRlZmF1bHQ6c3BhcmsifQ.iG7qZHJw2cAiculiCgRCgpMnPwSa9R7_nI9gj_u_Bb_T-nA5mhd2XhqTtjeLiEiz3JUmktDKMLyCYV_PYvpsqFprl7K50N2zO6Eg8hC3wFmDqI8-Kh-mBcmdx3clIfoFmhjvxmItYUc-WX2Iu5ESWtAKevWzxHKVfQ8bMsEJeSV8OrIuOx-5s38qJ4eautlz4WqSljeCnjN6CgnN4qSl4WLfkaYKeHRDk6BHKuCoTsCuf78wypm2rzJgsElARQd3aBKG1JJYv4nSkdlDY3Hp-wOS6SZqOHq9KHoN-EmXEoodlRTLQKD0CUKEjWI1XhewqzvO8YDnOCAj-yOyloUQ --conf spark.kubernetes.scheduler.name=volcano --conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep --conf spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar 没什么问题 在其他节点提交出现如下报错: 23/05/18 15:24:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/05/18 15:24:16 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 23/05/18 15:24:16 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. 23/05/18 15:24:16 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image. 23/05/18 15:24:17 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. 23/05/18 15:24:17 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/opt/spark-3.3.0-bin-2.7.4/conf) : spark-env.sh,ranger-spark-security.xml,ranger-spark-audit.xml,log4j2.properties 23/05/18 15:24:17 WARN VersionUsageUtils: The client is using resource type 'podgroups' with unstable version 'v1beta1' 23/05/18 15:24:24 ERROR Client: Please check "kubectl auth can-i create [resource]" first. It should be yes. And please also check your feature step implementation. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [delete] for kind: [PodGroup] with name: [spark-f0991120cd9940f2ba6aa76783cc6b02-podgroup] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:130) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.deleteThis(BaseOperation.java:532) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:445) at io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.delete(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableImpl.java:116) at io.fabric8.kubernetes.client.dsl.internal.NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.delete(NamespaceVisitFromServerGetWatchDeleteRecreateWaitApplicableListImpl.java:187) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:145) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2765) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:320) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:284) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:169) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.okhttp.OkHttpClientBuilderImpl$InteceptorAdapter.intercept(OkHttpClientBuilderImpl.java:62) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.okhttp.OkHttpClientBuilderImpl$InteceptorAdapter.intercept(OkHttpClientBuilderImpl.java:62) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.okhttp.OkHttpClientBuilderImpl$InteceptorAdapter.intercept(OkHttpClientBuilderImpl.java:62) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.okhttp.OkHttpClientBuilderImpl$InteceptorAdapter.intercept(OkHttpClientBuilderImpl.java:62) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257) at okhttp3.RealCall.execute(RealCall.java:93) at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl.send(OkHttpClientImpl.java:138) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.retryWithExponentialBackoff(OperationSupport.java:574) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:553) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleDelete(OperationSupport.java:285) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleDelete(OperationSupport.java:260) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.deleteThis(BaseOperation.java:527) ... 16 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 61 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 67 more

是clusterrole的权限不够么,里面无针对volcano的apigroup, 以edit为基础添加如下

wangyang0616 commented 1 year ago

Which version of Spark are you using, is it downloaded directly from the official website, or recompiled locally with the following command.

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pkubernetes -Pvolcano
LU1371046 commented 1 year ago

spark: 3.3.0 volcano: 1.6.0 k8s: 1.19.9 spark是源码自己编译的

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).