Improve jobs configuration

scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra

Apache License 2.0

54 stars 34 forks source link

Improve jobs configuration #135

Closed julienrf closed 2 months ago

julienrf commented 2 months ago

Use configured readThroughput/writeThroughput in the Hadoop job
If the configuration is not provided, fallback to a default value based on the table read/write capacity units
provide the average item size and the item count to the Hadoop job

These settings are used by the emr-dynamodb-connector to adjust the resource usage.

Relates to #130 and #133

julienrf commented 2 months ago

This PR is ready for review. The goal is to improve as much as we can the the information we provide to the Hadoop connector so that it can do its best job to tune the MapReduce job.

The memory capacity of the executors is still not correctly configured, and the default are used, leading to at most 5 RDD partitions in any case. Before we work on that problem, I would like to see what we get with the changes brought by this PR.

tarzanek commented 2 months ago

merging, we likely should uncomment and use throughputReadPercent too in my old PR this was one of factors how fast the jobs actually go (so besides split it will also do throttle)

julienrf commented 2 months ago

FTR, in the test suite it now uses 5 partitions by default to migrate from DynamoDB to Alternator even though in the test there is very few data to migrate. This is because the default behavior is based on the memory capacity of the Spark cluster and the read throughput of the source table. If we want to decrease the number of partitions we can set scanSegments: 1 in the migrator config file.