kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.81k stars 1.38k forks source link

Sparkapplication stuck forever? #1875

Open AlejandroUPC opened 1 year ago

AlejandroUPC commented 1 year ago

Just had a spark application that connects to some streaming service and consumes data, but the sparkapplication is stuck without sate for a too long time?

NAME            STATUS   ATTEMPTS   START   FINISH   AGE
**redacted**                                        5m16s

When checking the driver logs, all I see:

I1108 08:51:43.924523      10 controller.go:184] SparkApplication **readacted**/**redacted** was added, enqueuing it for submission

No pod is being created, other than the operator one and I am completely blind here, how can I debug this?

Thanks

Edit: After a while it crashed but the message error is just showing warnings?

failed to run spark-submit for SparkApplication **redacted**/**redacted**: 
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
https://repo1.maven.org/ added as a remote repository with the name: repo-1
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.microsoft.azure#azure-eventhubs-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a29b89a6-57af-47bc-a8c0-3e49d685c8f7;1.0
    confs: [default]
    found com.microsoft.azure#azure-eventhubs-spark_2.12;2.3.22 in central
    found com.microsoft.azure#azure-eventhubs;3.3.0 in central
    found org.apache.qpid#proton-j;0.33.8 in central
    found com.microsoft.azure#qpid-proton-j-extensions;1.2.4 in central
    found org.slf4j#slf4j-api;1.7.30 in central
    found com.microsoft.azure#azure-client-authentication;1.7.3 in central
    found com.microsoft.azure#azure-client-runtime;1.7.3 in central
    found com.microsoft.rest#client-runtime;1.7.3 in central
    found com.google.guava#guava;24.1.1-jre in central
    found com.google.code.findbugs#jsr305;1.3.9 in central
    found org.checkerframework#checker-compat-qual;2.0.0 in central
    found com.google.errorprone#error_prone_annotations;2.1.3 in central
    found com.google.j2objc#j2objc-annotations;1.1 in central
    found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
    found com.squareup.retrofit2#retrofit;2.7.2 in central
    found com.squareup.okhttp3#okhttp;3.12.6 in central
    found com.squareup.okio#okio;1.15.0 in central
    found com.squareup.okhttp3#logging-interceptor;3.12.2 in central
    found com.squareup.okhttp3#okhttp-urlconnection;3.12.2 in central
    found com.squareup.retrofit2#converter-jackson;2.7.2 in central
    found com.fasterxml.jackson.core#jackson-databind;2.10.1 in central
    found com.fasterxml.jackson.core#jackson-annotations;2.10.1 in central
    found com.fasterxml.jackson.core#jackson-core;2.10.1 in central
    found com.fasterxml.jackson.datatype#jackson-datatype-joda;2.10.1 in central
    found joda-time#joda-time;2.9.9 in central
    found org.apache.commons#commons-lang3;3.4 in central
    found io.reactivex#rxjava;1.3.8 in central
    found com.squareup.retrofit2#adapter-rxjava;2.7.2 in central
    found com.microsoft.azure#azure-annotations;1.10.0 in central
    found commons-codec#commons-codec;1.11 in central
    found com.microsoft.azure#adal4j;1.6.4 in central
    found com.nimbusds#oauth2-oidc-sdk;6.5 in central
    found com.sun.mail#javax.mail;1.6.1 in central
    found javax.activation#activation;1.1 in central
    found com.github.stephenc.jcip#jcip-annotations;1.0-1 in central
    found net.minidev#json-smart;2.3 in central
    [2.3] net.minidev#json-smart;[1.3.1,2.3]
    found net.minidev#accessors-smart;1.2 in central
    found org.ow2.asm#asm;5.0.4 in central
    found com.nimbusds#lang-tag;1.7 in central
    [1.7] com.nimbusds#lang-tag;[1.4.3,)
    found com.google.code.gson#gson;2.8.0 in central
    found com.nimbusds#nimbus-jose-jwt;9.8.1 in central
    found org.scala-lang.modules#scala-java8-compat_2.12;0.9.0 in central
:: resolution report :: resolve 44200ms :: artifacts dl 2200ms
    :: modules in use:
    com.fasterxml.jackson.core#jackson-annotations;2.10.1 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.10.1 from central in [default]
    com.fasterxml.jackson.core#jackson-databind;2.10.1 from central in [default]
    com.fasterxml.jackson.datatype#jackson-datatype-joda;2.10.1 from central in [default]
    com.github.stephenc.jcip#jcip-annotations;1.0-1 from central in [default]
    com.google.code.findbugs#jsr305;1.3.9 from central in [default]
    com.google.code.gson#gson;2.8.0 from central in [default]
    com.google.errorprone#error_prone_annotations;2.1.3 from central in [default]
    com.google.guava#guava;24.1.1-jre from central in [default]
    com.google.j2objc#j2objc-annotations;1.1 from central in [default]
    com.microsoft.azure#adal4j;1.6.4 from central in [default]
    com.microsoft.azure#azure-annotations;1.10.0 from central in [default]
    com.microsoft.azure#azure-client-authentication;1.7.3 from central in [default]
    com.microsoft.azure#azure-client-runtime;1.7.3 from central in [default]
    com.microsoft.azure#azure-eventhubs;3.3.0 from central in [default]
    com.microsoft.azure#azure-eventhubs-spark_2.12;2.3.22 from central in [default]
    com.microsoft.azure#qpid-proton-j-extensions;1.2.4 from central in [default]
    com.microsoft.rest#client-runtime;1.7.3 from central in [default]
    com.nimbusds#lang-tag;1.7 from central in [default]
    com.nimbusds#nimbus-jose-jwt;9.8.1 from central in [default]
    com.nimbusds#oauth2-oidc-sdk;6.5 from central in [default]
    com.squareup.okhttp3#logging-interceptor;3.12.2 from central in [default]
    com.squareup.okhttp3#okhttp;3.12.6 from central in [default]
    com.squareup.okhttp3#okhttp-urlconnection;3.12.2 from central in [default]
    com.squareup.okio#okio;1.15.0 from central in [default]
    com.squareup.retrofit2#adapter-rxjava;2.7.2 from central in [default]
    com.squareup.retrofit2#converter-jackson;2.7.2 from central in [default]
    com.squareup.retrofit2#retrofit;2.7.2 from central in [default]
    com.sun.mail#javax.mail;1.6.1 from central in [default]
    commons-codec#commons-codec;1.11 from central in [default]
    io.reactivex#rxjava;1.3.8 from central in [default]
    javax.activation#activation;1.1 from central in [default]
    joda-time#joda-time;2.9.9 from central in [default]
    net.minidev#accessors-smart;1.2 from central in [default]
    net.minidev#json-smart;2.3 from central in [default]
    org.apache.commons#commons-lang3;3.4 from central in [default]
    org.apache.qpid#proton-j;0.33.8 from central in [default]
    org.checkerframework#checker-compat-qual;2.0.0 from central in [default]
    org.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default]
    org.ow2.asm#asm;5.0.4 from central in [default]
    org.scala-lang.modules#scala-java8-compat_2.12;0.9.0 from central in [default]
    org.slf4j#slf4j-api;1.7.30 from central in [default]
    :: evicted modules:
    org.slf4j#slf4j-api;1.7.28 by [org.slf4j#slf4j-api;1.7.30] in [default]
    org.slf4j#slf4j-api;1.7.22 by [org.slf4j#slf4j-api;1.7.30] in [default]
    com.nimbusds#nimbus-jose-jwt;[6.0.1,) by [com.nimbusds#nimbus-jose-jwt;9.8.1] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   45  |   2   |   0   |   3   ||   42  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-a29b89a6-57af-47bc-a8c0-3e49d685c8f7
    confs: [default]
    0 artifacts copied, 42 already retrieved (0kB/600ms)
23/11/08 07:55:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/11/08 07:55:10 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
23/11/08 07:55:19 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
23/11/08 07:55:21 WARN DriverCommandFeatureStep: spark.kubernetes.pyspark.pythonVersion was deprecated in Spark 3.1. Please set 'spark.pyspark.python' and 'spark.pyspark.driver.python' configurations or PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables instead.
JavadHosseini commented 8 months ago

I have the same issue on kubernetes v28.

voducdan commented 4 months ago

I have faced the same problem. In my case, it seems likely that there're several SparkApplication having same name are submitted in same time. You should check the spark operator pods' logs for more information.

dacort commented 2 months ago

Same. Don't have detailed log messages (just the first WARN line that's normal) and happens only occasionally (e.g 9 out of 10 times in 5 seconds).

karanalang commented 2 months ago

I'm seeing similar issue on my setup .. spark operator crashed and I re-installed spark-operator, after that sparkapplication is not sjhowing any status.

In fact, there are no events for this spark application ->

(base) Karans-MacBook-Pro:~ karanalang$ kc get sparkapplication -n spark-operator NAME STATUS ATTEMPTS START FINISH AGE structured-streaming-350-1727739477 13m

ntw, this is installed on GKE