scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.
https://migrator.docs.scylladb.com/stable/
Apache License 2.0
61 stars 36 forks source link

The Ansible-based approach does not document precisely how to configure the Spark cluster #192

Closed julienrf closed 2 months ago

julienrf commented 3 months ago

The Ansible-based way to set up a Spark cluster uses a slightly different approach than the general guidelines for scaling the migrator.

Indeed, the scripts that invoke spark-submit supply some arguments to spark-submit based on the environment variables defined in the file spark-env, leading to some undocumented specifities regarding the way to scale the Spark cluster:

This is different to the general documentation, which recommends starting just one worker per node (each worker will use all the cores of the node) and then using --executor-cores in the spark-submit invocation to control how many cores to use.

I believe we should converge to a single way to scale the migrator, with as few variables as possible.

Furthermore, it seems the argument --num-executors used by the Ansible-based scripts is only supported in YARN, not in Spark Standalone (see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications).

guy9 commented 3 months ago

Thanks @julienrf . @GeoffMontee , @tarzanek , @pdbossman please have a look and add any input you might have.

guy9 commented 3 months ago

@julienrf suggests having only one Spark worker instance per worker node and scaling by adding more worker nodes. We will proceed with this solution if there are no objections (cc @tarzanek , @pdbossman )