pola-rs / tpch

MIT License
64 stars 35 forks source link

Why is the Spark memory set to 2gb? #127

Open MrPowers opened 3 weeks ago

MrPowers commented 3 weeks ago

Here's the line: https://github.com/pola-rs/tpch/blob/6c5bbe93a04cfcd25678dd860bab5ad61ad66edb/queries/pyspark/utils.py#L24

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

ritchie46 commented 3 weeks ago

@stinodego any clue?

stinodego commented 3 weeks ago

These are default values that will let you run scale factor 1 locally without any problems. If we use PySpark defaults, certain queries fail due to memory issues.

The benchmark blog has the actual values used during the benchmark:

For PySpark, driver memory and executor memory were set to 20g and 10g respectively.

I have tried a few different settings, and these seemed to work best for scale factor 10.

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

Could be - I am by no means a PySpark optimization expert. Perhaps they should implement better/dynamic defaults.