kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Misleading Exception when Executor fails: io.fabric8.kubernetes.client.KubernetesClientException #900

Open YoavNordmann opened 4 years ago

YoavNordmann commented 4 years ago

I am running SparkOperator v1beta2-1.1.0-2.4.5 on K3s with the webhook turned on.

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4+k3s1", GitCommit:"3eee8ac3a1cf0a216c8a660571329d4bda3bdf77", GitTreeState:"clean", BuildDate:"2020-03-25T16:13:25Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

There is a big difference when running the SparkJob via kubectl apply, and extracting the spark-submit from the sparkoperator logfile and running it from the sparkoperator pod. In my case, the executors which had to be raised failed immediately on some application exception. The difference is the following: Using the sparkoperator, a strange exception was raised in the driver:

spark-kubernetes-driver 20/05/02 23:23:17 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
spark-kubernetes-driver io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/default/pods. Message: Pod "t13-cassandra-batch-1588461787785-exec-1" is invalid: [metadata.labels: Invalid value: "sparkoperator.k8s.\"io/submission-id\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/submission-id\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/app-name\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/app-name\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9')]. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/submission-id\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/submission-id\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/app-name\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/app-name\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=metadata.labels, message=Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=t13-cassandra-batch-1588461787785-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "t13-cassandra-batch-1588461787785-exec-1" is invalid: [metadata.labels: Invalid value: "sparkoperator.k8s.\"io/submission-id\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/submission-id\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/app-name\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/app-name\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9([-a-z0-9*[a-z0-9)?(\.[a-z0-9([-a-z0-9*[a-z0-9)?)*'), metadata.labels: Invalid value: "sparkoperator.k8s.\"io/launched-by-spark-operator\"": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9[-A-Za-z0-9_.*)?[A-Za-z0-9')], metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

Only after about 20 seconds, the real exception was show in the driver log file.

On the other hand, when running the very same spark-submit from the sparkoperator pod, I did not receive this strange Exception, rather the real exception was shown in the driver logfile right away.

As you can see, this happens only in the executor, not in the driver. After trying to run this manually, I retrieved the "spark-submit" command from the SparkOperator and ran it from the SparkOperator Pod, and the same thing happened as well.

Needless to say that this exception threw me off completely and I tried to understand why "sparkoperator.k8s.\"io/submission-id\"" is being distorted and that actually this is my problem and not the actual application exception.

shekarreddy568 commented 3 years ago

Hey ,

We are also facing the same issue, Did you solve it by any chance?

YoavNordmann commented 3 years ago

Actually, we did... We are using TypeSafe's Config library. Our configuration is written in conf files. We take all properties which start by "spark" and create key-value pairs in a map to be "fed" into the spark session on creation. Turns out, when Config handles the key as a path and therefore whenever there was a "\" in the key, Config would wrap it with quotation marks and thus be "path" compliant. Problem is, this is not how a properties key is handles and spark is having a fit on that. We added a small function to handle the keys: def unwrapKey(key: String): String = String.join(".", ConfigUtil.splitPath(key)) Hope this helps

carasue commented 3 years ago

We have the same issue here, @YoavNordmann , your suggestion works. Thx a lot, spent a lot of time on this, @liyinan926, it this a bug or something? hope it can be solved

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.