malexer / pytest-spark

pytest plugin to run the tests with support of pyspark
MIT License
85 stars 30 forks source link

using spark_session fixture causes pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' #14

Closed dockerhub-publics closed 5 years ago

dockerhub-publics commented 5 years ago

I have put content of my test_sql_query_automation.py exactly as in the malexer's test/test_spark_session_fixture.py here in master branch.

To easily reproduce this you may want to use the same Docker Hub image that I use: danimages/spark-pytests

And here what I get:

$ pytest --spark_home=$SPARK_HOME -s -vv test_sql_query_automation.py ============================= test session starts ============================== platform linux -- Python 3.5.3, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 -- /usr/bin/python3 cachedir: .pytest_cache spark version -- Spark 2.4.1 built for Hadoop 2.6.5 | Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pkafka-0-8 -Pflume -Phadoop-provided -DzincPort=3038 rootdir: /builds/ber/Aufbau_BI_Platform, inifile: pytest.ini plugins: spark-0.5.2 collecting ... collected 2 items

test_sql_query_automation.py::test_spark_session_dataframe 2019-07-09 17:26:16,073 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). ERROR test_sql_query_automation.py::test_spark_session_sql ERROR

==================================== ERRORS ==================================== ____ ERROR at setup of test_spark_session_dataframe ____

a = ('xro49', <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8>, 'o47', 'sessionState') kw = {} s = "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" stackTrace = 'org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107...nd.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\t at java.lang.Thread.run(Thread.java:748)'

def deco(*a, **kw):
    try:
      return f(*a, **kw)

/usr/spark-2.4.1/python/pyspark/sql/utils.py:63:


answer = 'xro49' gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8> target_id = 'o47', name = 'sessionState'

def get_return_value(answer, gateway_client, target_id=None, name=None):
    """Converts an answer received from the Java gateway into a Python object.

    For example, string representation of integers are converted to Python
    integer, string representation of objects are converted to JavaObject
    instances, etc.

    :param answer: the string returned by the Java gateway
    :param gateway_client: the gateway client used to communicate with the Java
        Gateway. Only necessary if the answer is a reference (e.g., object,
        list, map)
    :param target_id: the name of the object from which the answer comes from
        (e.g., *object1* in `object1.hello()`). Optional.
    :param name: the name of the member from which the answer comes from
        (e.g., *hello* in `object1.hello()`). Optional.
    """
    if is_error(answer)[0]:
        if len(answer) > 1:
            type = answer[1]
            value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
            if answer[1] == REFERENCE_TYPE:
                raise Py4JJavaError(
                    "An error occurred while calling {0}{1}{2}.\n".
                  format(target_id, ".", name), value)

E py4j.protocol.Py4JJavaError: An error occurred while calling o47.sessionState. E : java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': E at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107) E at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145) E at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144) E at scala.Option.getOrElse(Option.scala:121) E at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) E at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141) E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) E at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) E at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) E at java.lang.reflect.Method.invoke(Method.java:498) E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) E at py4j.Gateway.invoke(Gateway.java:282) E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) E at py4j.commands.CallCommand.execute(CallCommand.java:79) E at py4j.GatewayConnection.run(GatewayConnection.java:238) E at java.lang.Thread.run(Thread.java:748) E Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder E at java.net.URLClassLoader.findClass(URLClassLoader.java:382) E at java.lang.ClassLoader.loadClass(ClassLoader.java:424) E at java.lang.ClassLoader.loadClass(ClassLoader.java:357) E at java.lang.Class.forName0(Native Method) E at java.lang.Class.forName(Class.java:348) E at org.apache.spark.util.Utils$.classForName(Utils.scala:238) E at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1102) E ... 16 more

/usr/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py:328: Py4JJavaError

During handling of the above exception, another exception occurred:

@pytest.fixture(scope='session')
def _spark_session():
    """Internal fixture for SparkSession instance.

    Yields SparkSession instance if it is supported by the pyspark
    version, otherwise yields None.

    Required to correctly initialize `spark_context` fixture after
    `spark_session` fixture.

    ..note::
        It is not possible to create SparkSession from the existing
        SparkContext.
    """

    try:
        from pyspark.sql import SparkSession
    except ImportError:
        yield
    else:
        session = SparkSession.builder \
          .config(conf=SparkConfigBuilder().get()) \

.enableHiveSupport() \ .getOrCreate()

/usr/local/lib/python3.5/dist-packages/pytest_spark/fixtures.py:28:


/usr/spark-2.4.1/python/pyspark/sql/session.py:183: in getOrCreate session._jsparkSession.sessionState().conf().setConfString(key, value) /usr/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:1257: in call answer, self.gateway_client, self.target_id, self.name)


a = ('xro49', <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8>, 'o47', 'sessionState') kw = {} s = "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" stackTrace = 'org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107...nd.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\t at java.lang.Thread.run(Thread.java:748)'

def deco(*a, **kw):
    try:
        return f(*a, **kw)
    except py4j.protocol.Py4JJavaError as e:
        s = e.java_exception.toString()
        stackTrace = '\n\t at '.join(map(lambda x: x.toString(),
                                         e.java_exception.getStackTrace()))
        if s.startswith('org.apache.spark.sql.AnalysisException: '):
            raise AnalysisException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.catalyst.analysis'):
            raise AnalysisException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.catalyst.parser.ParseException: '):
            raise ParseException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.streaming.StreamingQueryException: '):
            raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.execution.QueryExecutionException: '):
            raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('java.lang.IllegalArgumentException: '):
          raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

E pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"

/usr/spark-2.4.1/python/pyspark/sql/utils.py:79: IllegalArgumentException ___ ERROR at setup of test_spark_session_sql ___

a = ('xro49', <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8>, 'o47', 'sessionState') kw = {} s = "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" stackTrace = 'org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107...nd.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\t at java.lang.Thread.run(Thread.java:748)'

def deco(*a, **kw):
    try:
      return f(*a, **kw)

/usr/spark-2.4.1/python/pyspark/sql/utils.py:63:


answer = 'xro49' gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8> target_id = 'o47', name = 'sessionState'

def get_return_value(answer, gateway_client, target_id=None, name=None):
    """Converts an answer received from the Java gateway into a Python object.

    For example, string representation of integers are converted to Python
    integer, string representation of objects are converted to JavaObject
    instances, etc.

    :param answer: the string returned by the Java gateway
    :param gateway_client: the gateway client used to communicate with the Java
        Gateway. Only necessary if the answer is a reference (e.g., object,
        list, map)
    :param target_id: the name of the object from which the answer comes from
        (e.g., *object1* in `object1.hello()`). Optional.
    :param name: the name of the member from which the answer comes from
        (e.g., *hello* in `object1.hello()`). Optional.
    """
    if is_error(answer)[0]:
        if len(answer) > 1:
            type = answer[1]
            value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
            if answer[1] == REFERENCE_TYPE:
                raise Py4JJavaError(
                    "An error occurred while calling {0}{1}{2}.\n".
                  format(target_id, ".", name), value)

E py4j.protocol.Py4JJavaError: An error occurred while calling o47.sessionState. E : java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': E at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107) E at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145) E at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144) E at scala.Option.getOrElse(Option.scala:121) E at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) E at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141) E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) E at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) E at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) E at java.lang.reflect.Method.invoke(Method.java:498) E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) E at py4j.Gateway.invoke(Gateway.java:282) E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) E at py4j.commands.CallCommand.execute(CallCommand.java:79) E at py4j.GatewayConnection.run(GatewayConnection.java:238) E at java.lang.Thread.run(Thread.java:748) E Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder E at java.net.URLClassLoader.findClass(URLClassLoader.java:382) E at java.lang.ClassLoader.loadClass(ClassLoader.java:424) E at java.lang.ClassLoader.loadClass(ClassLoader.java:357) E at java.lang.Class.forName0(Native Method) E at java.lang.Class.forName(Class.java:348) E at org.apache.spark.util.Utils$.classForName(Utils.scala:238) E at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1102) E ... 16 more

/usr/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py:328: Py4JJavaError

During handling of the above exception, another exception occurred:

@pytest.fixture(scope='session')
def _spark_session():
    """Internal fixture for SparkSession instance.

    Yields SparkSession instance if it is supported by the pyspark
    version, otherwise yields None.

    Required to correctly initialize `spark_context` fixture after
    `spark_session` fixture.

    ..note::
        It is not possible to create SparkSession from the existing
        SparkContext.
    """

    try:
        from pyspark.sql import SparkSession
    except ImportError:
        yield
    else:
        session = SparkSession.builder \
          .config(conf=SparkConfigBuilder().get()) \

.enableHiveSupport() \ .getOrCreate()

/usr/local/lib/python3.5/dist-packages/pytest_spark/fixtures.py:28:


/usr/spark-2.4.1/python/pyspark/sql/session.py:183: in getOrCreate session._jsparkSession.sessionState().conf().setConfString(key, value) /usr/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:1257: in call answer, self.gateway_client, self.target_id, self.name)


a = ('xro49', <py4j.java_gateway.GatewayClient object at 0x7f4e88700ba8>, 'o47', 'sessionState') kw = {} s = "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" stackTrace = 'org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107...nd.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\t at java.lang.Thread.run(Thread.java:748)'

def deco(*a, **kw):
    try:
        return f(*a, **kw)
    except py4j.protocol.Py4JJavaError as e:
        s = e.java_exception.toString()
        stackTrace = '\n\t at '.join(map(lambda x: x.toString(),
                                         e.java_exception.getStackTrace()))
        if s.startswith('org.apache.spark.sql.AnalysisException: '):
            raise AnalysisException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.catalyst.analysis'):
            raise AnalysisException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.catalyst.parser.ParseException: '):
            raise ParseException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.streaming.StreamingQueryException: '):
            raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('org.apache.spark.sql.execution.QueryExecutionException: '):
            raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
        if s.startswith('java.lang.IllegalArgumentException: '):
          raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

E pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"

/usr/spark-2.4.1/python/pyspark/sql/utils.py:79: IllegalArgumentException =============================== warnings summary =============================== /usr/spark-2.4.1/python/pyspark/cloudpickle.py:47 /usr/spark-2.4.1/python/pyspark/cloudpickle.py:47: PendingDeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp

-- Docs: https://docs.pytest.org/en/latest/warnings.html ===================== 1 warnings, 2 error in 4.82 seconds ======================

malexer commented 5 years ago

I believe you are missing the spark-hive jar. Try adding to your image something like:

ADD https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.11/2.4.1/spark-hive_2.11-2.4.1.jar /usr/hadoop-3.0.0/share/hadoop/common/lib/
malexer commented 5 years ago

@dockerhub-publics closing, because this is not a pytest-spark related issue