Closed julienrf closed 2 months ago
Thanks @julienrf . @GeoffMontee , @tarzanek , @pdbossman please have a look and add any input you might have.
@julienrf suggests having only one Spark worker instance per worker node and scaling by adding more worker nodes. We will proceed with this solution if there are no objections (cc @tarzanek , @pdbossman )
The Ansible-based way to set up a Spark cluster uses a slightly different approach than the general guidelines for scaling the migrator.
Indeed, the scripts that invoke
spark-submit
supply some arguments tospark-submit
based on the environment variables defined in the filespark-env
, leading to some undocumented specifities regarding the way to scale the Spark cluster:This is different to the general documentation, which recommends starting just one worker per node (each worker will use all the cores of the node) and then using
--executor-cores
in thespark-submit
invocation to control how many cores to use.I believe we should converge to a single way to scale the migrator, with as few variables as possible.
Furthermore, it seems the argument
--num-executors
used by the Ansible-based scripts is only supported in YARN, not in Spark Standalone (see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications).