malexer / pytest-spark

pytest plugin to run the tests with support of pyspark
MIT License
85 stars 30 forks source link

Suggestions to speed up pytest-spark tests #9

Closed juhoautio closed 5 years ago

juhoautio commented 5 years ago

The spark session created by pytest-spark is not so optimized for small unit tests that only work with small dataframes.

pytest-spark seems to rely on whatever are Spark's default settings:

https://github.com/malexer/pytest-spark/blob/0152b555eb532710fd5bd212bd95134f9342e22f/pytest_spark/__init__.py#L101

Those defaults are aimed at working with some bigger data sets. IMHO it would make more sense to optimize the speed for smaller datasets.

We initially used pytest-spark (thanks for that!), but recently changed to create the spark session fixture with our own code (kudos to @artem-garmash!).

    """
    Parameters to reduce parallelism to make it run faster with test data
    """
    spark = SparkSession.builder \
        .config('spark.sql.shuffle.partitions', 1) \
        .config('spark.default.parallelism', 1) \
        .config('spark.rdd.compress', False) \
        .config('spark.shuffle.compress', False) \
        .enableHiveSupport() \
        .getOrCreate()

In our case the total test duration went from 7m:38s down to 3m:03s thanks to this change.

juhoautio commented 5 years ago

Also relates to https://github.com/malexer/pytest-spark/issues/8. To me it seems that ideally pytest-spark could offer defaults that are optimized for small datasets and then also somehow allow customizing any values of spark config.

malexer commented 5 years ago

Added in version 0.5.1, already on Pypi

gladykov commented 2 years ago

I came here from google, trying to speed up our pytest tests in github actions, and it was a fruitful day. We used those spark defaults https://pypi.org/project/pytest-spark/ and reduced time from 6 minutes to 3 minutes.

Then tweaked more

  1. Added pytest-parallel plugin and passed --workers=2 - 10 seconds faster
  2. To utilise it fully .set("spark.executor.cores", 2) - so 2 cores would handle 2 threads - 10 seconds faster