scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

$SPARK_HOME/work can fill up on worker nodes during DynamoDB migration #167

Open GeoffMontee opened 6 days ago

GeoffMontee commented 6 days ago

During a recent DynamoDB migration, $SPARK_HOME/work on the worker nodes grew to 479 GB, which filled up a 485 GB disk.

To try to work around this, we set:

export SPARK_WORKER_OPTS='-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=600'

For a description of the properties, see: https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

Interestingly, when this was retested with a smaller table, the $SPARK_HOME/work directory was 44 GB--the exact same size as the table:

$ du -sh /opt/spark/work/
44G /opt/spark/work/
$ aws dynamodb describe-table --table-name=MY_TABLE | grep "TableSizeBytes"
        "TableSizeBytes": 44174574774,

Is the entire table being written to disk?