rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

Add spark parameter for number of partitions. #48

Open luckyasser opened 7 years ago

luckyasser commented 7 years ago

Right now, the number of partitions in a spark job is determined by 2number of machines number of cores.. In a lot of situations, this is far from efficient(both memory and cpu wise).. While it's indeed recommended to set the number of partitions between 2-4 per core, in jobs where Dataset sizes are huge(100GB+), sometimes a partitioned RDD cannot fit into the main memory, which either causes spark to spill into disk(hence killing performance), or sometime, exhaust the cpu with the GC tasks. It is then advised to allow a user defined parameter "rheem.spark.numpartitions" to override SparkExecutor:numDefaultPartitions variable.

sekruse commented 7 years ago

This makes for a pragmatic solution for a more profound problem, namely platform configuration. The "Rheem" way here should be to hide such issues from the user (why would a user think about the number of Spark partitions when she does not even know which platform Rheem will pick for her app?)

So, if partitioning is an issue for large datasets, we might answer these questions first:

luckyasser commented 7 years ago

Totally agree. It is Rheem's business to select a partitioning strategy for the User. However, this is possibly a research topic by itself. There are few guidelines that can be automated. However, in practice most users fine tune more or less by trial and error.. That being said, having such a parameter might still be of value, following Rheem's practice of allowing the user to override all of its "clever" decisions (E.g specifying target execution platform per operator ).

sekruse commented 7 years ago

Right, let's implement said property first and file a new issue for intelligently picking partition sizes for the future.