uber / uReplicator

Improvement of Apache Kafka Mirrormaker
Apache License 2.0
917 stars 199 forks source link

1 Controller can connect to more than 1 Worker in Kubernetes ? #280

Open dungnt081191 opened 5 years ago

dungnt081191 commented 5 years ago

HI everyone , now I'm running uReplicator non-federated mode with Kubernetes : 1 Controller , 3 Worker ( replicas : 3 ) and this is log i check , 2 worker is not working . Just only 1 Worker connected and replicate ( anyway 1 worker can working fine with resource :

        resources:
          requests:
            cpu: 100m
            memory: 1500Mi

Log Worker 1 - FAIL TO CONNECT :

2019-10-23T08:34:12.872+0000: Total time for which application threads were stopped: 0.0266896 seconds, Stopping threads took: 0.0000626 seconds
[2019-10-23 08:34:12,903] INFO instance: testHelixMirrorMaker01 auto-joining uReplicatorDev is true (org.apache.helix.manager.zk.ParticipantManager:131)
[2019-10-23 08:34:13,014] WARN found another instance with same instanceName: testHelixMirrorMaker01 in cluster uReplicatorDev (org.apache.helix.manager.zk.ParticipantManager:184)
2019-10-23T08:34:13.754+0000: Total time for which application threads were stopped: 0.0003128 seconds, Stopping threads took: 0.0000342 seconds
2019-10-23T08:34:20.755+0000: Total time for which application threads were stopped: 0.0001660 seconds, Stopping threads took: 0.0000352 seconds
2019-10-23T08:34:21.755+0000: Total time for which application threads were stopped: 0.0002317 seconds, Stopping threads took: 0.0000356 seconds
2019-10-23T08:34:22.756+0000: Total time for which application threads were stopped: 0.0001658 seconds, Stopping threads took: 0.0000311 seconds
2019-10-23T08:34:25.756+0000: Total time for which application threads were stopped: 0.0002344 seconds, Stopping threads took: 0.0000351 seconds
2019-10-23T08:34:26.757+0000: Total time for which application threads were stopped: 0.0001314 seconds, Stopping threads took: 0.0000402 seconds
[2019-10-23 08:34:48,049] ERROR instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev (org.apache.helix.manager.zk.ParticipantManager:244)
2019-10-23T08:34:48.052+0000: Total time for which application threads were stopped: 0.0004388 seconds, Stopping threads took: 0.0000706 seconds
[2019-10-23 08:34:48,050] ERROR fail to createClient. (org.apache.helix.manager.zk.ZKHelixManager:496)
org.apache.helix.HelixException: instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev
    at org.apache.helix.manager.zk.ParticipantManager.createLiveInstance(ParticipantManager.java:245)
    at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:112)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:900)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:866)
    at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:531)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
[2019-10-23 08:34:48,053] ERROR fail to connect testHelixMirrorMaker01 (org.apache.helix.manager.zk.ZKHelixManager:534)
org.apache.helix.HelixException: instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev
    at org.apache.helix.manager.zk.ParticipantManager.createLiveInstance(ParticipantManager.java:245)
    at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:112)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:900)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:866)
    at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:531)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
[2019-10-23 08:34:48,054] INFO Is not shutting down; call cleanShutdown() (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:48,054] INFO Start clean shutdown. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:48,054] INFO Flushing last batches and commit offsets. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:48,055] INFO Flushing producer. (kafka.mirrormaker.WorkerInstance:66)
Exception in thread "main" java.lang.NullPointerException
    at kafka.mirrormaker.WorkerInstance.maybeFlushAndCommitOffsets(WorkerInstance.scala:346)
    at kafka.mirrormaker.WorkerInstance.cleanShutdown(WorkerInstance.scala:385)
    at kafka.mirrormaker.WorkerInstance$WorkerZKHelixManager.disconnect(WorkerInstance.scala:328)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:535)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
2019-10-23T08:35:04.055+0000: Total time for which application threads were stopped: 0.0001446 seconds, Stopping threads took: 0.0000401 seconds
2019-10-23T08:35:08.056+0000: Total time for which application threads were stopped: 0.0001423 seconds, Stopping threads took: 0.0000332 seconds
2019-10-23T08:36:13.065+0000: Total time for which application threads were stopped: 0.0003642 seconds, Stopping threads took: 0.0001302 seconds
2019-10-23T08:38:09.082+0000: Total time for which application threads were stopped: 0.0003055 seconds, Stopping threads took: 0.0002081 seconds

Log Worker 2 - FAIL TO CONNECT:

   [Eden: 24576.0K(24576.0K)->0.0B(65536.0K) Survivors: 8192.0K->8192.0K Heap: 29812.9K(120.0M)->9444.5K(120.0M)]
 [Times: user=0.02 sys=0.01, real=0.05 secs] 
2019-10-23T08:34:12.220+0000: Total time for which application threads were stopped: 0.0548126 seconds, Stopping threads took: 0.0000321 seconds
[2019-10-23 08:34:12,269] INFO instance: testHelixMirrorMaker01 auto-joining uReplicatorDev is true (org.apache.helix.manager.zk.ParticipantManager:131)
2019-10-23T08:34:12.273+0000: Total time for which application threads were stopped: 0.0003265 seconds, Stopping threads took: 0.0000291 seconds
[2019-10-23 08:34:12,398] WARN found another instance with same instanceName: testHelixMirrorMaker01 in cluster uReplicatorDev (org.apache.helix.manager.zk.ParticipantManager:184)
2019-10-23T08:34:13.274+0000: Total time for which application threads were stopped: 0.0001932 seconds, Stopping threads took: 0.0000356 seconds
2019-10-23T08:34:17.275+0000: Total time for which application threads were stopped: 0.0001489 seconds, Stopping threads took: 0.0000422 seconds
2019-10-23T08:34:22.276+0000: Total time for which application threads were stopped: 0.0001529 seconds, Stopping threads took: 0.0000294 seconds
[2019-10-23 08:34:47,437] ERROR instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev (org.apache.helix.manager.zk.ParticipantManager:244)
Exception in thread "main" java.lang.NullPointerException
    at kafka.mirrormaker.WorkerInstance.maybeFlushAndCommitOffsets(WorkerInstance.scala:346)
    at kafka.mirrormaker.WorkerInstance.cleanShutdown(WorkerInstance.scala:385)
    at kafka.mirrormaker.WorkerInstance$WorkerZKHelixManager.disconnect(WorkerInstance.scala:328)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:535)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
[2019-10-23 08:34:47,439] ERROR fail to createClient. (org.apache.helix.manager.zk.ZKHelixManager:496)
org.apache.helix.HelixException: instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev
    at org.apache.helix.manager.zk.ParticipantManager.createLiveInstance(ParticipantManager.java:245)
    at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:112)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:900)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:866)
    at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:531)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
[2019-10-23 08:34:47,442] ERROR fail to connect testHelixMirrorMaker01 (org.apache.helix.manager.zk.ZKHelixManager:534)
org.apache.helix.HelixException: instance: testHelixMirrorMaker01 already has a live-instance in cluster uReplicatorDev
    at org.apache.helix.manager.zk.ParticipantManager.createLiveInstance(ParticipantManager.java:245)
    at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:112)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:900)
    at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:866)
    at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:493)
    at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:531)
    at kafka.mirrormaker.WorkerInstance.addToHelixController(WorkerInstance.scala:340)
    at kafka.mirrormaker.WorkerInstance.start(WorkerInstance.scala:250)
    at kafka.mirrormaker.MirrorMakerWorker.main(MirrorMakerWorker.scala:109)
    at com.uber.stream.kafka.mirrormaker.starter.MirrorMakerStarter.main(MirrorMakerStarter.java:44)
[2019-10-23 08:34:47,443] INFO Is not shutting down; call cleanShutdown() (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:47,443] INFO Start clean shutdown. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:47,443] INFO Flushing last batches and commit offsets. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-23 08:34:47,444] INFO Flushing producer. (kafka.mirrormaker.WorkerInstance:66)
2019-10-23T08:35:03.282+0000: Total time for which application threads were stopped: 0.0002168 seconds, Stopping threads took: 0.0001056 seconds
2019-10-23T08:35:52.290+0000: Total time for which application threads were stopped: 0.0001672 seconds, Stopping threads took: 0.0000682 seconds
2019-10-23T08:37:58.309+0000: Total time for which application threads were stopped: 0.0001976 seconds, Stopping threads took: 0.0000903 seconds

I have some question : 1 - anyone know the recommend resource ? 2 - how can i use 1 controller with multi worker to make sure it's work everytime and increase performance ?

dungnt081191 commented 5 years ago

@yangy0000 @xhl1988 can you clarify my issue ?

yangy0000 commented 5 years ago

you need to use different instanceId for different workers.

dungnt081191 commented 5 years ago

@yangy0000 hi , it's work , but i have cretical issue about message lost : Run uReplicator in Kubernetes with 1 Controller + 2 Worker , it's working fine when all pod UP . But when delete worker pod - the newest pod UP but topic is not replicate anymore .

dungnt081191 commented 5 years ago

solution is : graceful shutdown Worker . set prestop in k8s with shell command to kill pid and all Worker restart is okay , it's replicate correctly the lag message to destination topic