uber / uReplicator

Improvement of Apache Kafka Mirrormaker
Apache License 2.0
917 stars 199 forks source link

uber replicator configuration for the test environment. #223

Open Gk8143197049 opened 5 years ago

Gk8143197049 commented 5 years ago

Can I have the example configurations so that we can test the lag. I am trying to evaluate the uber replicator?? I have cloned the master branch. where can I configure the controller starter config and want to know how can I configure the src cluster and target cluster for all components. once I understand. I can also build document for U Replicator.

Please let me know??

yangy0000 commented 5 years ago

Here is the sample starter config. We will compose an READ.ME later

-mode customized \ -- helix rebalance mode(always use customized) -enableAutoWhitelist true \ -- enable topic auto whitelist or not -port 9000 \ -- controller rest endpoint port -refreshTimeInSeconds 10 \ -- auto whitelist background check interval -srcKafkaZkPath localhost:2181/cluster1 \ -- source broker -zookeeper localhost:2181 \ -- zookeeper for helix cluster -destKafkaZkPath localhost:2181/cluster1 \ -- destination broker -helixClusterName testMirrorMaker -- helix cluster name

Gk8143197049 commented 5 years ago

My current version is kafka version 2.1.1 . I am running this below scripts from the kafka broker kafka01.xxxx (kafka/zookeeper is running, i.e. my source cluster) and source cluster is kafka01.xxxx target cluster is kafka02.xxxxx ( kafka, zookeeper is already running) . Controller script to kick the controller up and running? ./uReplicator-Distribution/target/uReplicator-Distribution-pkg/bin/start-controller.sh -port 9000 -helixClusterName uReplicatorDev -refreshTimeInSeconds 10 -enableAutoWhitelist true -srcKafkaZkPath kafka01.xxxxxxxxxxxxxxxxx:2181 -destKafkaZkPath kafka02.xxxxxxxxxxxxxxxx:2181 1.how do I check the source kafka path ( i just gave the kafka zookeeper connection below) ? Is it right ?
2.what should be configured in the producer and consumer properties( either source or target cluster, In my case I am running the uber replicator on source cluster) for my scenario? and also helix configuration details ?? Worker Script: ./uReplicator-Distribution/target/uReplicator-Distribution-pkg/bin/start-worker.sh --consumer.config ./config/consumer.properties --producer.config ./config/producer.properties --helix.config ./config/helix.properties

for my worker configuration I gave source cluster details in consumer, producer , it fails ? Helix Configurations zkServer=kafka01.xxxxxxxxxxxxx:2181 instanceId=testHelixMirrorMaker01 helixClusterName=uReplicatorDev enableAutowhitelist=true port=9000 srckafkaZkpath=kafka01.xxxxxxxxxxxxxxxxxxxxxxxxx:2181 destKafkaZkpath=kafka02.xxxxxxxxxxxxxxxxxxxxxxxx:2181

Consumer properties zookeeper.connect=kafka01.xxxxxxxxxxxxxxxx:2181 zookeeper.connection.timeout.ms=30000 zookeeper.session.timeout.ms=30000

consumer group id

group.id=kloak-mirrormaker-test consumer.id=kloakmms01-sjc1 partition.assignment.strategy=roundrobin socket.receive.buffer.bytes=1048576 fetch.message.max.bytes=8388608 queued.max.message.chunks=5 auto.offset.reset=smallest

Producer.properties bootstrap.servers=kafka01.xxxxxxxxxxxxxxxxxxxxxxxxxxx:9092,kafka01.xxxxxxxxxxxx:9093,kafka02.xxxxxxxxx:9092 client.id=kloak-mirrormaker-test producer.type=async compression.type=none key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

Let me know . What mistakes I have made on the configurations on worker and controller ? By this I will be able to build the document? Where can I look for logs for controller and worker ??

yangy0000 commented 5 years ago

-zookeeper xxxx:2181 is missing for the controller, your controller might not running. (-zookeepr xxxx:2181 need to be the same as zkServer in helix.properties for worker)

let's say if you use kafka01.xxxxxxxxxxxxx:2181 as -zookeeper, you should able to see /uReplicatorDev path under kafka01.xxxxxxxxxxxxx:2181 if you setup controller successfully

Gk8143197049 commented 5 years ago

Right now I have added the zookeeper in the controller start up script. Yes. I have passed same zookeeper details for the controller and zkserver in helix properties. Do I need to create topics in destination cluster( as similar name in Source Cluster) broker for worker to enable whitelist ?

I will post ,if I get errors. Mostly the configuration should work.

yangy0000 commented 5 years ago

Yes, you need to create topic in destination cluster manually

Gk8143197049 commented 5 years ago

Got you! My destination server is cloud server which is turned off so the controller is failing. I will get back to ou tomorrow morning . Once it is good to go I can create document in a proper format and drop it as readme or configuration document.

Thanks.

yangy0000 commented 5 years ago

Also, when you set -enableAutoWhitelist true in controller, uReplicator will automatic replicate message when topic exists in both source and destination cluster, (Example 2 in readme) when -enableAutoWhitelist set to false, you need to add topic to uReplicator manually(Example 1)

Gk8143197049 commented 5 years ago

CONTROLLER ERRORS: 1. where Do I change the directory path for the controller as the current user does not have permissions in /var/log. I can only see the info where do i put the logs in debug for now to see the errors in depth? How do I correct them??

[2019-03-13 08:02:19,686] ERROR Error writing backup to the file idealState-backup-env-uReplicatorDev (com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler:48) java.io.FileNotFoundException: /var/log/kafka-mirror-maker-controller/idealState-backup-env-uReplicatorDev (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at java.io.FileWriter.(FileWriter.java:90) at com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler.writeToFile(FileBackUpHandler.java:43) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager.dumpState(ClusterInfoBackupManager.java:188) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager$1.run(ClusterInfoBackupManager.java:69) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [2019-03-13 08:02:19,704] ERROR Failed to take backup with exception (com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager:71) java.io.FileNotFoundException: /var/log/kafka-mirror-maker-controller/idealState-backup-env-uReplicatorDev (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at java.io.FileWriter.(FileWriter.java:90) at com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler.writeToFile(FileBackUpHandler.java:43) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager.dumpState(ClusterInfoBackupManager.java:188) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager$1.run(ClusterInfoBackupManager.java:69) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

SIMILAR ERRORS FOR ALL TOPICS ( "does not match external view" how to fix them )

[2019-03-13 08:16:58,666] ERROR For topic:topics1, number of partitions for IdealState: topics1, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} doesn't match ExternalView: topics1, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177) [2019-03-13 08:16:58,674] ERROR For topic:topics2, number of partitions for IdealState: topics2, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} doesn't match ExternalView: topics2, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177) [2019-03-13 08:16:58,695] ERROR For topic:topics3, number of partitions for IdealState: topics3, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} doesn't match ExternalView: topics3, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177) [2019-03-13 08:16:58,725] ERROR For topic:topics4, number of partitions for IdealState: topics4, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} doesn't match ExternalView: topics4, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177) [2019-03-13 08:16:58,731] ERROR For topic:topics5, number of partitions for IdealState: topics5, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} doesn't match ExternalView: topics5, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177)

Gk8143197049 commented 5 years ago

My worker shutdows saying [2019-03-13 09:43:24,032] WARN [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-0]: Error in fetch Name: FetchRequest; Version: 3; CorrelationId: 12; ClientId: kloak-mirrormaker-test; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; MaxBytes:2147483647 bytes; RequestInfo: (connect-offsets-17,PartitionFetchInfo(5574,8388608)),(KafkaCruiseControlModelTrainingSamples-2,PartitionFetchInfo(693664,8388608)),(KafkaCruiseControlModelTrainingSamples-28,PartitionFetchInfo(693646,8388608)),(KafkaCruiseControlPartitionMetricSamples-0,PartitionFetchInfo(10484957,8388608)),(connect-statuses-1,PartitionFetchInfo(161248,8388608)),(KafkaCruiseControlPartitionMetricSamples-2,PartitionFetchInfo(10372823,8388608)),(connect-statuses-4,PartitionFetchInfo(171450,8388608)),(KafkaCruiseControlPartitionMetricSamples-4,PartitionFetchInfo(10260544,8388608)),(connect-offsets-2,PartitionFetchInfo(62969,8388608)),(KafkaCruiseControlModelTrainingSamples-30,PartitionFetchInfo(693646,8388608)),(KafkaCruiseControlPartitionMetricSamples-12,PartitionFetchInfo(10371857,8388608)),(connect-statuses-3,PartitionFetchInfo(83091,8388608)),(connect-offsets-15,PartitionFetchInfo(18954,8388608)),(connect-offsets-12,PartitionFetchInfo(7115,8388608)),(connect-offsets-9,PartitionFetchInfo(0,8388608)),(connect-offsets-4,PartitionFetchInfo(15289,8388608)),(KafkaCruiseControlPartitionMetricSamples-10,PartitionFetchInfo(10372834,8388608)),(connect-statuses-0,PartitionFetchInfo(130572,8388608)),(connect-offsets-22,PartitionFetchInfo(36186,8388608)),(KafkaCruiseControlPartitionMetricSamples-18,PartitionFetchInfo(10316469,8388608)),(connect-offsets-5,PartitionFetchInfo(523448,8388608)),(KafkaCruiseControlModelTrainingSamples-6,PartitionFetchInfo(693655,8388608)),(KafkaCruiseControlPartitionMetricSamples-16,PartitionFetchInfo(10372110,8388608)),(KafkaCruiseControlModelTrainingSamples-22,PartitionFetchInfo(693691,8388608)),(connect-offsets-0,PartitionFetchInfo(0,8388608)),(KafkaCruiseControlPartitionMetricSamples-8,PartitionFetchInfo(10260652,8388608)),(connect-offsets-10,PartitionFetchInfo(0,8388608)),(connect-offsets-7,PartitionFetchInfo(3208,8388608)),(connect-offsets-11,PartitionFetchInfo(3025,8388608)),(_confluent-ksql-default__command_topic-0,PartitionFetchInfo(0,8388608)),(KafkaCruiseControlModelTrainingSamples-16,PartitionFetchInfo(693709,8388608)),(KafkaCruiseControlModelTrainingSamples-24,PartitionFetchInfo(693704,8388608)),(KafkaCruiseControlPartitionMetricSamples-30,PartitionFetchInfo(10203994,8388608)),(KafkaCruiseControlModelTrainingSamples-14,PartitionFetchInfo(693678,8388608)),(connect-offsets-6,PartitionFetchInfo(7631,8388608)),(KafkaCruiseControlPartitionMetricSamples-14,PartitionFetchInfo(10316236,8388608)),(connect-offsets-24,PartitionFetchInfo(0,8388608)),(_schemas-0,PartitionFetchInfo(9568,8388608)),(KafkaCruiseControlModelTrainingSamples-12,PartitionFetchInfo(693701,8388608)),(KafkaCruiseControlModelTrainingSamples-26,PartitionFetchInfo(693695,8388608)),(connect-offsets-14,PartitionFetchInfo(16918,8388608)),(connect-offsets-23,PartitionFetchInfo(8343,8388608)),(KafkaCruiseControlPartitionMetricSamples-20,PartitionFetchInfo(10204351,8388608)),(CruiseControlMetrics-0,PartitionFetchInfo(26967492,8388608)),(connect-offsets-1,PartitionFetchInfo(0,8388608)),(connect-statuses-2,PartitionFetchInfo(133463,8388608)),(connect-offsets-18,PartitionFetchInfo(8167,8388608)),(connect-offsets-8,PartitionFetchInfo(0,8388608)),(connect-offsets-3,PartitionFetchInfo(153942,8388608)),(connect-offsets-21,PartitionFetchInfo(0,8388608)),(KafkaCruiseControlPartitionMetricSamples-24,PartitionFetchInfo(10204023,8388608)),(connect-offsets-16,PartitionFetchInfo(1649,8388608)),(KafkaCruiseControlModelTrainingSamples-0,PartitionFetchInfo(693676,8388608)),(KafkaCruiseControlModelTrainingSamples-20,PartitionFetchInfo(693696,8388608)),(connect-offsets-20,PartitionFetchInfo(3577,8388608)),(connect-configs-0,PartitionFetchInfo(494530,8388608)),(connect-offsets-19,PartitionFetchInfo(29607,8388608)),(connect-offsets-13,PartitionFetchInfo(0,8388608)),(KafkaCruiseControlModelTrainingSamples-18,PartitionFetchInfo(693736,8388608)),(KafkaCruiseControlPartitionMetricSamples-6,PartitionFetchInfo(10204440,8388608)),(KafkaCruiseControlModelTrainingSamples-8,PartitionFetchInfo(693698,8388608)),(KafkaCruiseControlModelTrainingSamples-10,PartitionFetchInfo(693724,8388608)),(KafkaCruiseControlModelTrainingSamples-4,PartitionFetchInfo(693735,8388608)),(KafkaCruiseControlPartitionMetricSamples-26,PartitionFetchInfo(10091771,8388608)),(KafkaCruiseControlPartitionMetricSamples-22,PartitionFetchInfo(10148366,8388608)),(__KafkaCruiseControlPartitionMetricSamples-28,PartitionFetchInfo(10113719,8388608)). Possible cause: java.lang.OutOfMemoryError: Java heap space (kafka.mirrormaker.CompactConsumerFetcherThread:70)

[2019-03-13 09:43:24,033] ERROR [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-0]: In FetcherThread error due to (kafka.mirrormaker.CompactConsumerFetcherThread:76) java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30) at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:140) at kafka.network.BlockingChannel.readCompletely(BlockingChannel.scala:131) at kafka.network.BlockingChannel.receive(BlockingChannel.scala:122) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:102) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:86) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:135) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:135) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:135) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:134) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:134) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:134) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:133) at kafka.mirrormaker.CompactConsumerFetcherThread.processFetchRequest(CompactConsumerFetcherThread.scala:242) at kafka.mirrormaker.CompactConsumerFetcherThread.doWork(CompactConsumerFetcherThread.scala:216)at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-03-13 09:43:24,033] ERROR [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-0]: Got OOM or processing error, exit (kafka.mirrormaker.CompactConsumerFetcherThread:74) [2019-03-13 09:43:24,033] ERROR [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-0]: First OOM or processing error, call System.exit(-1); (kafka.mirrormaker.CompactConsumerFetcherThread:74)

[2019-03-13 09:43:27,289] INFO Shutting down pool: java.util.concurrent.ThreadPoolExecutor@59474f18[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] (org.apache.helix.messaging.handling.HelixTaskExecutor:474) [2019-03-13 09:43:27,289] INFO Reset exectuor for msgType: STATE_TRANSITION, pool: java.util.concurrent.ThreadPoolExecutor@1649b0e6[Running, pool size = 98, active threads = 0, queued tasks = 0, completed tasks = 98] (org.apache.helix.messaging.handling.HelixTaskExecutor:530) [2019-03-13 09:43:27,289] INFO Shutting down pool: java.util.concurrent.ThreadPoolExecutor@1649b0e6[Running, pool size = 98, active threads = 0, queued tasks = 0, completed tasks = 98] (org.apache.helix.messaging.handling.HelixTaskExecutor:474) [2019-03-13 09:43:27,363] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,364] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,365] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,366] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,367] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] WARN Default reset method invoked. Either because the process longer own this resource or session timedout (org.apache.helix.participant.statemachine.StateModel:67) [2019-03-13 09:43:27,368] INFO Registering bean: ParticipantName=testHelixMirrorMaker01 (org.apache.helix.monitoring.ParticipantStatusMonitor:109)

yangy0000 commented 5 years ago

looks like your worker out of memory.

Gk8143197049 commented 5 years ago

Is it at my source cluster ? when I look at my source cluster kafka01. it says "2019-03-13T11:50:01.334-0700: Total time for which application threads were stopped: 0.0070540 seconds, Stopping threads took: 0.0000253 seconds 2019-03-13T11:50:01.773-0700: [GC pause (G1 Evacuation Pause) (young) Desired survivor size 8388608 bytes, new threshold 15 (max 15)

yangy0000 commented 5 years ago

Can you go to start-worker.sh to increase the memory size and have a try again ? Btw , has any message been replicate to destination yet ?

Gk8143197049 commented 5 years ago

I have updated the heapregionsize to 512MB( it was 8M earlier) . beolw is the updated one 👍 exec "$JAVACMD" $JAVA_OPTS -Dapp_name=uReplicator-Worker -cp uReplicator-Distribution/target/uReplicator-Distribution-0.1-SNAPSHOT-jar-with-dependencies.jar -XX:+UseG1GC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintReferenceGC -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=512M -XX:InitiatingHeapOccupancyPercent=85 -XX:+UnlockExperimentalVMOptions -XX:G1MixedGCLiveThresholdPercent=85 -XX:G1HeapWastePercent=5 \

Gk8143197049 commented 5 years ago

No replication to destination yet..

Gk8143197049 commented 5 years ago

the controller log says "[2019-03-13 13:10:07,903] ERROR For topic:oracle-20-MERCHANT, number of partitions for IdealState: topic1, {IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{0={testHelixMirrorMaker01=ONLINE}}{} doesn't match ExternalView: oracle-20-MERCHANT, {BUCKET_SIZE=0, IDEAL_STATE_MODE=CUSTOMIZED, MAX_PARTITIONS_PER_INSTANCE=1, NUM_PARTITIONS=1, REBALANCE_MODE=CUSTOMIZED, REPLICAS=1, STATE_MODEL_DEF_REF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT}{}{} (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:177) [2019-03-13 13:10:07,903] WARN Got exception when removing metrics for externalView.topicPartitions.testHelixMirrorMaker01.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319) java.lang.NullPointerException at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317) at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210) at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Gk8143197049 commented 5 years ago

The controller says: [2019-03-13 13:31:36,154] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:31:36,165] ERROR Error registering metrics! (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:304) java.lang.NullPointerException at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:301) at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210) at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2019-03-13T13:31:37.157-0700: Total time for which application threads were stopped: 0.0002224 seconds, Stopping threads took: 0.0001259 seconds [2019-03-13 13:32:36,136] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:33:01,048] INFO AutoRebalanceLiveInstanceChangeListener.rebalanceCurrentCluster() wakes up! (com.uber.stream.kafka.mirrormaker.controller.core.AutoRebalanceLiveInstanceChangeListener:200) [2019-03-13 13:33:01,051] INFO No assignment got changed, do nothing! (com.uber.stream.kafka.mirrormaker.controller.core.AutoRebalanceLiveInstanceChangeListener:226) [2019-03-13 13:33:36,145] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:33:36.192-0700: Total time for which application threads were stopped: 0.0001266 seconds, Stopping threads took: 0.0000325 seconds [2019-03-13 13:34:36,137] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:34:36.199-0700: Total time for which application threads were stopped: 0.0001149 seconds, Stopping threads took: 0.0000222 seconds [2019-03-13 13:35:36,135] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:36:36,145] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:37:07.313-0700: Total time for which application threads were stopped: 0.0006683 seconds, Stopping threads took: 0.0005770 seconds [2019-03-13 13:37:36,137] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:37:36.326-0700: Total time for which application threads were stopped: 0.0001337 seconds, Stopping threads took: 0.0000358 seconds [2019-03-13 13:38:36,135] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:39:26.357-0700: Total time for which application threads were stopped: 0.0006990 seconds, Stopping threads took: 0.0005737 seconds [2019-03-13 13:39:36,153] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:39:36.365-0700: Total time for which application threads were stopped: 0.0001070 seconds, Stopping threads took: 0.0000266 seconds [2019-03-13 13:40:36,140] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:40:36.388-0700: Total time for which application threads were stopped: 0.0001364 seconds, Stopping threads took: 0.0000414 seconds [2019-03-13 13:41:36,149] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:41:36.417-0700: Total time for which application threads were stopped: 0.0001539 seconds, Stopping threads took: 0.0000460 seconds [2019-03-13 13:42:36,136] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:42:36.430-0700: Total time for which application threads were stopped: 0.0001555 seconds, Stopping threads took: 0.0000532 seconds [2019-03-13 13:43:36,135] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:43:36.460-0700: Total time for which application threads were stopped: 0.0001258 seconds, Stopping threads took: 0.0000370 seconds 2019-03-13T13:44:06.491-0700: Total time for which application threads were stopped: 0.0042761 seconds, Stopping threads took: 0.0000446 seconds [2019-03-13 13:44:36,136] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:44:36.494-0700: Total time for which application threads were stopped: 0.0001369 seconds, Stopping threads took: 0.0000266 seconds [2019-03-13 13:45:36,154] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:45:48,095] INFO Scan statusUpdates and errors for cluster: uReplicatorDev, by controller: org.apache.helix.manager.zk.ZKHelixManager@2f09fe81 (org.apache.helix.monitoring.ZKPathDataDumpTask:67) 2019-03-13T13:46:18.587-0700: Total time for which application threads were stopped: 0.0001589 seconds, Stopping threads took: 0.0000660 seconds [2019-03-13 13:46:36,136] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:47:36,138] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) 2019-03-13T13:47:36.606-0700: Total time for which application threads were stopped: 0.0001283 seconds, Stopping threads took: 0.0000366 seconds [2019-03-13 13:48:36,138] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83) [2019-03-13 13:49:36,138] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83)

The worker says: 1246.340: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 2902, predicted base time: 5.19 ms, remaining time: 194.81 ms, target pause time: 200.00 ms] 1246.340: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 9 regions, survivors: 2 regions, predicted young region time: 16.98 ms] 1246.340: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 9 regions, survivors: 2 regions, old: 0 regions, predicted pause time: 22.18 ms, target pause time: 200.00 ms] 2019-03-13T13:51:43.520-0700: [SoftReference, 0 refs, 0.0003968 secs]2019-03-13T13:51:43.520-0700: [WeakReference, 0 refs, 0.0001228 secs]2019-03-13T13:51:43.520-0700: [FinalReference, 0 refs, 0.0001053 secs]2019-03-13T13:51:43.520-0700: [PhantomReference, 0 refs, 0 refs, 0.0001648 secs]2019-03-13T13:51:43.520-0700: [JNI Weak Reference, 0.0000098 secs], 0.0159517 secs] [Parallel Time: 14.7 ms, GC Workers: 2] [GC Worker Start (ms): Min: 1246340.8, Avg: 1246341.3, Max: 1246341.8, Diff: 1.0] [Ext Root Scanning (ms): Min: 0.1, Avg: 0.5, Max: 1.0, Diff: 0.9, Sum: 1.1] [Update RS (ms): Min: 1.7, Avg: 1.7, Max: 1.7, Diff: 0.0, Sum: 3.3] [Processed Buffers: Min: 6, Avg: 8.0, Max: 10, Diff: 4, Sum: 16] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Object Copy (ms): Min: 11.6, Avg: 11.7, Max: 11.7, Diff: 0.0, Sum: 23.3] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [GC Worker Total (ms): Min: 13.4, Avg: 13.9, Max: 14.4, Diff: 1.0, Sum: 27.8] [GC Worker End (ms): Min: 1246355.2, Avg: 1246355.2, Max: 1246355.2, Diff: 0.0] [Code Root Fixup: 0.0 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.1 ms] [Other: 1.1 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.9 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.0 ms] [Eden: 288.0M(288.0M)->0.0B(256.0M) Survivors: 64.0M->64.0M Heap: 688.0M(960.0M)->448.8M(960.0M)] [Times: user=0.03 sys=0.00, real=0.01 secs] 2019-03-13T13:51:43.521-0700: Total time for which application threads were stopped: 0.0161888 seconds, Stopping threads took: 0.0000244 seconds 2019-03-13T13:51:44.442-0700: [GC pause (G1 Evacuation Pause) (young) Desired survivor size 33554432 bytes, new threshold 1 (max 15)

xhl1988 commented 5 years ago

You can add the following in your controller start command to remove the metrics NullPointerException.

-graphiteHost $METRICS_REPORTER_GRAPHITE_HOST | your graphite host
-graphitePort $METRICS_REPORTER_GRAPHITE_PORT | your graphite port
-metricsPrefix $METRICS_REPORTER_PREFIX | metric prefix

Worker log seems all fine.

Even with the NullPointerException, does your ureplicator start to replicate data to destination?

Gk8143197049 commented 5 years ago

No I don't see any data repicating from src cluster topic to target cluster. I checked my destination cluster and source clusters.

How do we deal with file not fond exception. [2019-03-13 14:23:14,189] ERROR Error writing backup to the file idealState-backup-env-uReplicatorDev (com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler:48) java.io.FileNotFoundException: /var/log/kafka-mirror-maker-controller/idealState-backup-env-uReplicatorDev (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at java.io.FileWriter.(FileWriter.java:90) at com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler.writeToFile(FileBackUpHandler.java:43) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager.dumpState(ClusterInfoBackupManager.java:188) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager$1.run(ClusterInfoBackupManager.java:69) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) [2019-03-13 14:23:14,191] ERROR Failed to take backup with exception (com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager:71) java.io.FileNotFoundException: /var/log/kafka-mirror-maker-controller/idealState-backup-env-uReplicatorDev (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:162) at java.io.FileWriter.(FileWriter.java:90) at com.uber.stream.kafka.mirrormaker.controller.core.FileBackUpHandler.writeToFile(FileBackUpHandler.java:43) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager.dumpState(ClusterInfoBackupManager.java:188) at com.uber.stream.kafka.mirrormaker.controller.core.ClusterInfoBackupManager$1.run(ClusterInfoBackupManager.java:69) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

Gk8143197049 commented 5 years ago

I think I gave wrong properties for consumer properties. Can you take a look in the above chain. My src cluster ( kafka01) is where I have installed the uReplicator controller and worker what parameters should I pass coorectly or precisely?

xhl1988 commented 5 years ago
  1. Can you do curl localhost:9000/topics and make sure there are topic whitelisted?
  2. Pick one topic and do curl localhost:9000/topics/<topic> | jq . and see if every partition is assigned a worker and externalview/idealstate matches.
  3. Is there new data produced to your topic? By default, uReplicator only replicates new data after whitelisting. If you want all existing data, put auto.offset.reset=smallest in consumer.properties.
Gk8143197049 commented 5 years ago

I am following your directions but can you look at my producer properties if it is right ??

Thanks

xhl1988 commented 5 years ago

Your producer properties looks fine. We might need more information from the log to identify what goes wrong.

xhl1988 commented 5 years ago

I think the issue is within the worker. Can you paste more/full worker log after you have done above steps?

Gk8143197049 commented 5 years ago

Sure. My destination cluster(Cloud server) and is turned off. I will paste the logs tomorrow morning.

Gk8143197049 commented 5 years ago

I have attached the log. Do take a look at it. I do not have any data transfer from source cluster to target cluster.

workerlog.docx

xhl1988 commented 5 years ago

Is that gc log? Do you have the application log?

Btw, what's the result of the steps above? can you also paste them here?

Gk8143197049 commented 5 years ago

I have attahced the workter logs. I have data flowing into the topic on Source Cluster.

  1. [kafka@kafka01 confluent-5.1.0]$ curl -X POST -d '{"topic":"oracle-24-xxxxxxxxxxxx", "numPartitions":"1"}' http://localhost:9000/topics Successfully add new topic: {topic: oracle-24-xxxxxxxxxx, partition: 1, pipeline: null}[kafka@kafka01

  2. confluent-5.1.0]$ curl localhost:9000/topics Current serving topics: [oracle-20-xxxxxxxxxxx, oracle-23-xxxx, oracle-24-xxxxxxxx][kafka@kafka01

3.[kafka@kafka01 confluent-5.1.0]$ curl localhost:9000/topics/oracle-24-xxxxxxxx |jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 247 100 247 0 0 4287 0 --:--:-- --:--:-- --:--:-- 4333 { "externalView": { "0": [ "testHelixMirrorMaker01" ] }, "idealState": { "0": [ "testHelixMirrorMaker01" ] }, "serverToNumPartitionsMapping": { "testHelixMirrorMaker01": 1 }, "serverToPartitionMapping": { "testHelixMirrorMaker01": [ "0" ] }, "topic": "oracle-24-xxxxxxxx" }

workerlog.txt

xhl1988 commented 5 years ago

The worker did receive the partition assignment and try to consume from source cluster but failed:

[2019-03-15 07:44:11,089] WARN Unexpected error code: 76. (org.apache.kafka.common.protocol.Errors:719)

From source Kafka code, error 76 means:

UNSUPPORTED_COMPRESSION_TYPE(76, "The requesting client does not support the compression type of given partition.", UnsupportedCompressionTypeException::new);

What compression type do you use in your source cluster?

Gk8143197049 commented 5 years ago

The source Cluster is compression.type=none at the producer Cluster The destination CLuster is compression.type=none at the producer Cluster.

THanks

Gk8143197049 commented 5 years ago

Do you want me to change the properties . DO let me know. I cando it.

Thanks. I realy appreciate your time. !!

xhl1988 commented 5 years ago

What do you mean by producer Cluster? There should be a config in your broker config: compression.type=<>, what's your value in your broker?

Gk8143197049 commented 5 years ago

@xhl1988 -- We can configure the compression type either in sever.properties or producer.properties. I have 2 different properties file like server.properties and producer.properties

server.properties listeners=PLAINTEXT://:9092 advertised.listeners=PLAINTEXT://your.host.name:9092 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=xx/kafka-logs num.partitions=1 num.recovery.threads.per.data.dir=1 offsets.topic.replication.factor=1 transaction.state.log.replication.factor=1 transaction.state.log.min.isr=1 log.flush.interval.messages=100 log.flush.interval.ms=10 log.retention.hours=1 log.segment.bytes=1073741824 log.retention.check.interval.ms=30 zookeeper.connect=localhost:2181 zookeeper.connection.timeout.ms=6000 confluent.support.metrics.enable=false metric.reporters=com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter group.initial.rebalance.delay.ms=0

producer.properties bootstrap.servers=localhost:9092,localhost:9093

specify the compression codec for all data generated: none, gzip, snappy, lz4, zstd

compression.type=none

Gk8143197049 commented 5 years ago

Do let me know , which on supports with Uberreplicator. I can change accordingly.

xhl1988 commented 5 years ago

We are able to reproduce the same error on our local if there is any msg with zstd compression type.

  1. Can you set compression.type=uncompressed on broker's sever.properties.
  2. Can you delete all exiting message for those topics and try again?
Gk8143197049 commented 5 years ago

I have changed the compression.type on the source brokers and I did not see any 76 errors( unsupported . I donot see any data flowing from source to destination cluster.

egrep for 76 and I was not able to find any error? What do you think can be error?

workerlog1.docx

xhl1988 commented 5 years ago

New logs seem all good. Are you producing to the topic with new messages?

Edit: looks like worker copied 442848 messages for oracle-23-xxx, can you check that?

xhl1988 commented 5 years ago

You can add -Xloggc:<path to gc log> remove remove the gc info from the application log.

Gk8143197049 commented 5 years ago

Yes I am Producing the New messages. I am surprised I don't see any data in the target cluster.

can I add to the worker script -Xloggc:.?

Thanks.

xhl1988 commented 5 years ago
  1. Can you use the same method to see if you can consume the message of topic oracle-23-MIF-0 from source cluster?
  2. Yes, in the start script. maybe just output to /tmp/urep-gc.log

Your log seems fine:

[2019-03-15 11:35:02,467] INFO [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1]: Topic partitions dump in fetcher thread: ArrayBuffer([oracle-24-MIF_CHANGES-0, Offset 281329869] , [oracle-23-MIF-0, Offset 0] ) 

[2019-03-15 11:40:02,517] INFO [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1]: Topic partitions dump in fetcher thread: ArrayBuffer([oracle-24-MIF_CHANGES-0, Offset 281329869] , [oracle-23-MIF-0, Offset 442848] ) (kafka.mirrormaker.CompactConsumerFetcherThread:66)

It means you have copied 442848 message for oracle-23-MIF-0. Can you check the commit offset in zookeeper: <srckafkaZkpath>:/consumer/kloak-mirrormaker-test/offsets/oracle-23-MIF-0/0 ? see if it's at least 442848?

Also after you remove the gc log, you can collect the logger for a longer time and see if the offset in the log increase.

xhl1988 commented 5 years ago

You can periodically check the committed offset, if it increases, then the replication should be working.

Gk8143197049 commented 5 years ago

I am tryed with removing one of the producer hostname? what are ports used for workers, controller and manger and helix ??can you let me know. i think this is a network issue?

Gk8143197049 commented 5 years ago

@xhl1988 Does this mean the topics replica are dumping in kafka01 with another broker with 9093 on the same cluster? . if yes? How do I change it another broker sitting on another cluster ? The source broker is kafka01:9092.

Logs : cat nohup.out |egrep ArrayBuffer [2019-03-19 10:17:59,986] INFO [CompactConsumerFetcherManager-1553015862247] Fetcher Thread for topic partitions: ArrayBuffer([oracle-23-MIF-0, InitialOffset -1] ) is CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1 (kafka.mirrormaker.CompactConsumerFetcherManager:66) [2019-03-19 10:17:59,987] INFO [CompactConsumerFetcherManager-1553015862247] Added fetcher for partitions ArrayBuffer([oracle-23-xxx-0, initOffset -1 to broker BrokerEndPoint(1,kafka01.xxxxxxx,9093)] ) (kafka.mirrormaker.CompactConsumerFetcherManager:66) [2019-03-19 10:17:59,988] INFO [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1]: Topic partitions dump in fetcher thread: ArrayBuffer([oracle-23-xxx-0, Offset 0] ) (kafka.mirrormaker.CompactConsumerFetcherThread:66) [2019-03-19 10:23:00,108] INFO [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1]: Topic partitions dump in fetcher thread: ArrayBuffer([oracle-23-xxx-0, Offset 416354] ) (kafka.mirrormaker.CompactConsumerFetcherThread:66) [2019-03-19 10:28:00,189] INFO [CompactConsumerFetcherThread-kloak-mirrormaker-test_kloakmms01-sjc1-0-1]: Topic partitions dump in fetcher thread: ArrayBuffer([oracle-23-xxx-0, Offset 1057382] ) (kafka.mirrormaker.CompactConsumerFetcherThread:66)

yangy0000 commented 5 years ago

for source clusters, uReplicator get broker list by using zookeeper.connect in consumer.properties I saw you set zookeeper.connect=kafka01.xxxxxxxxxxxxxxxx:2181. How many kafka clusters are this zookeeper?

Gk8143197049 commented 5 years ago

the kafka01.xxxxxxxx:2181 has 2 brokers on the same cluster on different ports( 9092, 9093) and the target cluster(kafka01.) has 2 brokers(9092,9093) on a different zookeeper ( kafka01.;2181)

Gk8143197049 commented 5 years ago

the kafka01.xxxxxxxx:2181 has 2 brokers on the same cluster on different ports( 9092, 9093) and the target cluster(kafka01.) has 2 brokers(9092,9093) on a different zookeeper ( kafka01.;2181)

can this be network /security issue contacting the other server. I think we must put the logs in debug Mode to see the error more clearly?

Gk8143197049 commented 5 years ago

Note : the target cluster is a cloud aws server.

xhl1988 commented 5 years ago

From your log, the worker is definitely working. Have you checked the commit offset in your source zk?

Gk8143197049 commented 5 years ago

I was trying to look at from the zookeeper shell and it does not show up in the commit offset 442848.

xhl1988 commented 5 years ago

Can you show me the command you ran and the result?

Gk8143197049 commented 5 years ago

I ran the zookeeper shell command bin/zookeeper-shell xxxxxxxxxxxxxxxxxxx.com:2181 Connecting to xxxxxxxxxxxxx:2181 Welcome to ZooKeeper! JLine support is enabled [zk: xxxxxxxxxxxxxxxxxxxxxx:2181(CONNECTING) 0] WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:2181(CONNECTED) 0] GET consumer/kloak-mirrormaker-test/offsets/oracle-23-MIF-0/0 it does not give any value