scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

Perform load balancing when communicating with a cluster of Alternators #132

Open julienrf opened 2 months ago

julienrf commented 2 months ago

Customize the emr-dynamodb-connector to configure the DynamoDB client to use the load balancing request handler from https://github.com/scylladb/alternator-load-balancing.

Unfortunately, the project emr-dynamodb-connector has not been designed to support customization of the underlying DynamoDB client. Therefore, I had to copy-paste code from the project and change it to customize the underlying DynamoDB client. This led to creating a parallel class hierarchy co-existing with the existing one, but with names all prefixed by LoadBalanced.

The PR should be reviewed commit-by-commit and ignoring the content of 424141ba727e5fbcdbbf72bcbe27125a25909793 and 2b7f0762d830d606d953522718f6eac347b29007 which copy-paste verbatim content from emr-dynamodb-connector.

Blocked by scylladb/alternator-load-balancing#18 Fixes #117

julienrf commented 2 months ago

Unfortunately, the PR contains a lot of changes, but in my opinion this is still the best approach to go for now.

I’ve also considered using reflection-based approaches to access the private DynamoDB client of the Hadoop job to configure it to supply our custom request handler, but this is not simple to achieve because we don’t have access to the instance that holds that private DynamoDB client. This means that we would have to instrument the bytecode to impact the places that create and configure the DynamoDB client. This could be achieved with tools like AspectJ or Java Instrumentation, but in my opinion using those tools would not make our maintenance work simpler because they require some specific expertise.

On the other hand, with simple changes in emr-dynamodb-connector, we could easily achieve our goal. For instance, if awslabs/emr-dynamodb-connector#196 is merged, we will be able to set up our custom request handler like so:

// in DynamoUtils.setDynamoDBJobConf
jobConf.set(DynamoDBConstants.CUSTOM_CLIENT_BUILDER_TRANSFORMER, "WithLoadBalancing")
jobConf.set("scylladb.alternator.endpoint", "https://127.0.0.1:8043")

// Where the class WithLoadBalancing would be defined as follows
class WithLoadBalancing extends DynamoDbClientBuilderTransformer with Configurable {
  private var conf: Configuration = null
  def apply(builder: DynamoDbClientBuilder): DynamoDbClientBuilder =
    builder.endpointProvider(
      new AlternatorEndpointProvider(URI.create(conf.get("scylladb.alternator.endpoint")))
    )
  def setConf(conf: Configuration): Unit = {
    this.conf = conf
  }
  def getConf: Configuration = this.conf
}

(note that we would also have to upgrade to v2 of AWS SDK to use the latest version of the connector)

tarzanek commented 2 months ago

we have no power over AWS code, so if they will accept it or not is a dice roll ;-) so we can worst case go temporarily with this big fork and then cleanup once AWS merges and releases your PR ?