scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

support load balancing for alternator access when doing migration for DynamoDB #117

Open tarzanek opened 3 months ago

tarzanek commented 3 months ago

Right now the connection to alternator makes only 1 shard busy and has really no load balancing

There is a blog https://resources.scylladb.com/dynamodb-replacement/load-balancing-in-scylla-alternator

and code sample https://github.com/scylladb/alternator-load-balancing/tree/master/java

explaining the problem

we need to try to have this automatically used when using alternator either as source or as target (or as both! :-) )

julienrf commented 3 months ago

The solution described in the blog article uses a DynamoDB request handler to handle the load balancing. However, in our case the migrator delegates the communication with the Alternator to the Hadoop connector, which does not seem to provide a way to inject request handlers (see https://github.com/awslabs/emr-dynamodb-connector/blob/master/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/DynamoDBClient.java).

I will investigate if there is something in alternator-load-balancing that we could reuse and plug in the Hadoop connector.

julienrf commented 3 months ago

As I described above, I see no way to inject a request handler in the Hadoop connector. To use the load balancer we need to override some parts of the Hadoop connectors to inject the desired request handler.

Currently, the Hadoop configuration uses the class DynamoDBOutputFormat to write to the Alternator.

We could change the configuration to use a custom class that we would provide and that would inject the desired request handler to the underlying DynamoDB driver.

The role of the class DynamoDBOutputFormat is to provide an Hadoop RecordWriter that writes to DynamoDB (or Alternator, in our case). Currently, it returns a DefaultDynamoDBRecordWriter instance, which extends the class AbstractDynamoDBRecordWriter. That class contains some writing logic and is also responsible for instantiating the class DynamoDBClient, which contains even more logic to communicate with DynamoDB, and which uses the AWS DynamoDbClient under the hood.

So, ideally, we would like to override the DynamoDBClient class to configure its underlying AWS DynamoDbClient to use the request handler that performs the load balancing. However, this class has not been designed to be extensible. Everything is private. So, we have to copy-paste it in our repository and directly apply our changes there. It is 500 lines of code.

The story is similar regarding the class AbstractDynamoDBRecordWriter. Its only extension point is its abstract method convertValueToDynamoDBItem, but there is no way to override its DynamoDBClient member, which is private. So, we would also have to copy-paste this class definition in our repository to use our own DynamoDBClient. This class is less than 200 lines of code.

Last, we would also have to define our equivalent of DynamoDBOutputFormat, which would be used by the Hadoop configuration, and which is responsible for providing our own record writer. This one is just a few lines of code.

Here is a diagram that summarizes the architecture.

classDiagram
  DynamoDBOutputFormat  --> DefaultDynamoDBRecordWriter
  DefaultDynamoDBRecordWriter --|> AbstractDynamoDBRecordWriter
  AbstractDynamoDBRecordWriter *-- "1" DynamoDBClient
  class DynamoDBClient {
    - dynamoDb: aws.DynamoDbClient
  }