Closed juhoautio closed 5 years ago
Also relates to https://github.com/malexer/pytest-spark/issues/8. To me it seems that ideally pytest-spark could offer defaults that are optimized for small datasets and then also somehow allow customizing any values of spark config.
Added in version 0.5.1, already on Pypi
I came here from google, trying to speed up our pytest tests in github actions, and it was a fruitful day. We used those spark defaults https://pypi.org/project/pytest-spark/ and reduced time from 6 minutes to 3 minutes.
Then tweaked more
--workers=2
- 10 seconds faster.set("spark.executor.cores", 2)
- so 2 cores would handle 2 threads - 10 seconds faster
The spark session created by pytest-spark is not so optimized for small unit tests that only work with small dataframes.
pytest-spark seems to rely on whatever are Spark's default settings:
https://github.com/malexer/pytest-spark/blob/0152b555eb532710fd5bd212bd95134f9342e22f/pytest_spark/__init__.py#L101
Those defaults are aimed at working with some bigger data sets. IMHO it would make more sense to optimize the speed for smaller datasets.
We initially used pytest-spark (thanks for that!), but recently changed to create the spark session fixture with our own code (kudos to @artem-garmash!).
In our case the total test duration went from 7m:38s down to 3m:03s thanks to this change.