malexer / pytest-spark

pytest plugin to run the tests with support of pyspark
MIT License
84 stars 30 forks source link

Add spark session fixture #1

Closed clembou closed 7 years ago

clembou commented 7 years ago

Hi @malexer

This PR adds a new fixture called spark_session that provides a pyspark.sql.SparkSession with hive support enabled.

This appears to be the recommended entry point when using the DataFrame api these days.

I also added a few tests to ensure these work while I was at it.

malexer commented 7 years ago

Hi @clembou

Great feature! Just some minor issues:

  1. Can you introduce few style fixes into the code according to PEP8:
    $ flake8 -q
    ./test/test_spark_session_fixture.py
    ./test/test_spark_context_fixture.py
    ./pytest_spark/__init__.py
  2. By SparkSession we are limiting the usage to Spark 2.x only. May be we can check the version of the Spark and yield HiveContext in case of <2.x. What do you think?
clembou commented 7 years ago

@malexer sorry I forgot to run the reformatter on the code, it is fixed now!

Good point on spark <2.x, I added a version check that will raise an Exception on spark 1.

I think SQLContext and HiveContext, unlike SparkSession, are easy (and quick) to create from the spark_context object - as such I am not sure it is worth providing a fixture for those? If so I think it would be best done explicitly e.g. by adding a sql_context and hive_context fixture rather than overloading the spark_session one?

malexer commented 7 years ago

I agree with you, let's leave spark_session as it is now. Thanks for your efforts!

clembou commented 7 years ago

awesome! Thanks @malexer !