nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Added --spark-executor-instances to specify how many instances by worker #181

Closed douglaz closed 7 years ago

douglaz commented 7 years ago

This PR makes the SPARK_EXECUTOR_INSTANCES be configurable.

I tested this PR by manually launching a cluster and running a job.

nchammas commented 7 years ago

What's the motivation for adding this?

I am wary about adding more options for Spark since there is a potentially a lot of stuff we could end up duplicating into Flintrock. Typically, I'd just say "Use run-command to deploy whatever config changes you want to the cluster." It's clumsy but generic.

Regarding executor instances in particular, why not vary your instance size or count if you want to control how many executors you have?

douglaz commented 7 years ago

@nchammas I'm sorry, but I'm so used to this feature that I don't even know how people do without it. I mean, how can you run the same job in a x1.32xlarge and a r3.xlarge without using this option or, alternatively, tweaking other JVM/Spark parameters?

Our solution is to break any big machine (big machine is any machine with more than 60GB of ram) in small machines (JVMs) with ~30GB. So in this example, we will set the x1.32xlarge to 64 executor instances. In this way we always have a homogenous cluster from the perspective of executor JVM, letting us use the same GC defaults for any configuration. Also, even considering the case where no GC tuning is needed, like when the machine is an r3.2xlarge, multiple smaller JVMs seems to run faster than a single big JVM (e.g networking/data transfer is improved).

So this simple single option let us avoid the need to use many other configurations and handling different machines in different ways.

douglaz commented 7 years ago

@nchammas rewritten to conform with current master

nchammas commented 7 years ago

Thanks for sharing the additional information. Makes sense to me. I'm curious if @BenFradet, @serialx, or any other users using Flintrock with large instance types have felt the need to use this setting before.

I'll try to code review this PR next, hopefully by next weekend (though, as with most open source, I can't make any promises).

serialx commented 7 years ago

Code seems to be okay to me. export SPARK_EXECUTOR_CORES="$(($(nproc) / {spark_executor_instances}))" could be a reasonable default.

It is recommended to set SPARK_EXECUTOR_INSTANCES when the instance has many cores.

I've also used to set SPARK_EXECUTOR_INSTANCES greater than 2 when we used spark-ec2 script. But resorted to using SPARK_EXECUTOR_INSTANCES=1 after we migrated to flintrock. It would be good to support this feature in flintrock. This settings also makes sense when doing the benchmarks too.

nchammas commented 7 years ago

Sorry for dragging my feet on this. Thank you for contributing this feature and persisting through the review process @douglaz!

Thanks also to @serialx for providing his thoughts, which helped me decide what to do here.