The Ansible-based approach does not document precisely how to configure the Spark cluster

julienrf commented 3 months ago

The Ansible-based way to set up a Spark cluster uses a slightly different approach than the general guidelines for scaling the migrator.

Indeed, the scripts that invoke spark-submit supply some arguments to spark-submit based on the environment variables defined in the file spark-env, leading to some undocumented specifities regarding the way to scale the Spark cluster:

users can define several worker instances per node, and they can control how many cores should be used by each worker,
then, users define how many executors should be used by the migrator (and each executor uses all the cores of their worker node).

This is different to the general documentation, which recommends starting just one worker per node (each worker will use all the cores of the node) and then using --executor-cores in the spark-submit invocation to control how many cores to use.

I believe we should converge to a single way to scale the migrator, with as few variables as possible.

Furthermore, it seems the argument --num-executors used by the Ansible-based scripts is only supported in YARN, not in Spark Standalone (see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications).

guy9 commented 3 months ago

Thanks @julienrf . @GeoffMontee , @tarzanek , @pdbossman please have a look and add any input you might have.

guy9 commented 3 months ago

@julienrf suggests having only one Spark worker instance per worker node and scaling by adding more worker nodes. We will proceed with this solution if there are no objections (cc @tarzanek , @pdbossman )

scylladb / scylla-migrator

The Ansible-based approach does not document precisely how to configure the Spark cluster #192