Open MrPowers opened 5 months ago
@stinodego any clue?
These are default values that will let you run scale factor 1 locally without any problems. If we use PySpark defaults, certain queries fail due to memory issues.
The benchmark blog has the actual values used during the benchmark:
For PySpark, driver memory and executor memory were set to 20g and 10g respectively.
I have tried a few different settings, and these seemed to work best for scale factor 10.
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).
Could be - I am by no means a PySpark optimization expert. Perhaps they should implement better/dynamic defaults.
Here's the line: https://github.com/pola-rs/tpch/blob/6c5bbe93a04cfcd25678dd860bab5ad61ad66edb/queries/pyspark/utils.py#L24
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).