scylladb / kafka-connect-scylladb

Kafka Connect Scylladb Sink
Apache License 2.0
46 stars 22 forks source link

how to write multiple topics to single scylla table #55

Open dhgokul opened 3 years ago

dhgokul commented 3 years ago

At present we have topics - topic1, topic2 and topic3. Using separate sink connector (separate server)for ach topic , instead of writing each topic message to different scylla table, looking to write in common-single table for all topics.

At present used confluent regex property to achieve above . But its not efficient, as overwriting happens on regex used sink connectors.

Is there any method to achieve in efficient.

avelanarius commented 3 years ago

Could you be more specific on what the problem is? Are you observing a poor performance of RegexRouter ("regex property")? Or is another part of the system slow (the connector itself)? It seems (after looking at RegexRouter source code, doing some micro-benchmarks) that this transform should not add a major amount of overhead.

dhgokul commented 3 years ago

Using sink connector we trying to write multiple topics from redpanda to a single Scylla table. In our case did testing of 100 Ml test using 5 topics, used 5 sink connectors, the topics includes: topic1, sub1-topic1, sub2-topic1,sub3-topic1,sub4-topic1. In redpanda we have 3 cluster. Each topic has partition 10. Tried with and without replica in redpanda

**Connector [JSON] Config:**

bootstrap.servers=redpanda_cluster_1_ip:9092,redpanda_cluster_2_ip:9092,redpanda_cluster_3_ip:9092
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.file.filename=/tmp/connect.offsets
plugin.path=target/components/packages/

**Sink Connector-1 Config [Json]:**
name=scylladb-sink-connector
connector.class =io.connect.scylladb.ScyllaDbSinkConnector
tasks.max= 56
topics =topic1
scylladb.contact.points=scylla_cluster_1_ip,scylla_cluster_2_ip,scylla_cluster_3_ip
scylladb.port=9042
scylladb.keyspace=streamprocess
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
transforms=createKey
transforms.createKey.fields=id
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey

**Sink Connector-2 Config [Json]:**

name=scylladb-sink-connector2
connector.class =io.connect.scylladb.ScyllaDbSinkConnector
tasks.max= 56
topics =sub1-topic1
scylladb.contact.points= scylla_cluster_1_ip,scylla_cluster_2_ip,scylla_cluster_3_ip
scylladb.consistency.level= QUORUM 
scylladb.keyspace.replication.factor=3
scylladb.port=9042
scylladb.keyspace=streamprocess
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
transforms.createKey.fields=id
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms=createKey,dropPrefix
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.dropPrefix.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.dropPrefix.regex=sub1-(.*)
transforms.dropPrefix.replacement=$1

using sink connector config 1 for topic1 and Sink Connector-2 Config for the rest of topics, running in 5 separate machines, we facing 2 issues: 1) Comparing sink connector 1 config, sink connector 2 config are slower 2) Overwriting of messages are happening in scylla cluster, i.e once 100 milion messages dumbed in scylla, we restarted only sink connectors, but overwriting of message happening, when checked node tool tablestats instead of local write count 10million it showing 100+ million counts.

Are there any changes needed in the config?