uber / uReplicator

Improvement of Apache Kafka Mirrormaker
Apache License 2.0
917 stars 198 forks source link

Urgent - Worker restart , always have a number of duplicate message in topic #282

Open dungnt081191 opened 5 years ago

dungnt081191 commented 5 years ago

Hi everyone , I'm implementing uReplicator with latest master branch ( commit 70ac3404bb4517756c6813d3962b77edc7700019 ). As the another commit on master , after Worker restart, always have lots of message duplicate in Destination topic .

Here is overview my step :

step1: Build the latest code

step2: build Docker image with this Dockerfile

FROM openjdk:8-jre

COPY confd-0.15.0-linux-amd64 /usr/local/bin/confd
COPY uReplicator/uReplicator-Distribution/target/uReplicator-Distribution-pkg /uReplicator
COPY uReplicator/config uReplicator/config
COPY confd-new /etc/confd
#I use confd to generate configuration file in Container 
COPY entrypoint-new.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh && \
    chmod +x /usr/local/bin/confd && \
    chmod +x /uReplicator/bin/*.sh && 

ENTRYPOINT [ "/entrypoint.sh" ]

Step3: prepare 2 topic:

clusters.properties

kafka.cluster.zkStr.cluster1=src-zk:2181
kafka.cluster.servers.cluster1=src-kafka:9092
kafka.cluster.zkStr.cluster2=dst-zk:2181
kafka.cluster.servers.cluster2=dst-kafka:9092

consumer.properties

zookeeper.connect=src-zk:2181
bootstrap.servers=src-kafka:9092
zookeeper.connection.timeout.ms=30000
zookeeper.session.timeout.ms=30000
group.id=kloak-mirrormaker-test
consumer.id=kloakmms01-sjc1
socket.receive.buffer.bytes=1048576
fetch.message.max.bytes=8388608
queued.max.message.chunks=5
key.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
auto.offset.reset=smallest

dstzk.properties

enable=true
zkServer=src-zk:2181

helix.properties : disable instanceId because of i'm using more than 1 Worker

zkServer=src-zk:2181
#instanceId=testHelixMirrorMaker01
helixClusterName=testMirrorMaker
federated.deployment.name=uReplicator-Tests

producer.properties

bootstrap.servers=dst-kafka:9092
client.id=kloak-mirrormaker-test

Step4: check number of message on both server

Src & Dst topic : 600 message

Step5 : Delete 1 worker POD kubernetes by : kubectl delete pod .... ( no --force command )

result : 1 - old POD is Terminating. Src & Dst keep sync up on time : 750 -750 on both topic

2 - a few second later: new POD is UP Src topic : keep rising to 900 -> 95- -> 1000 Dst topic : stay at 750 message

3 - a minutes later : Src topic : 1000 message Dst topic : more than 1000 message : sometime 1200 message , sometime 1100-1300...etc ( the result is different in every test )

Note : if no Worker is down. The replication result is perfect . But in this case in everytime is test , the result is different and uReplicator is not stable. @yangy0000 @Pawka @DAddYE @hparra @chirayuk @dentafrice CC to anyone Who contributed to this application . Please let me know/correct me if something went wrong .

Technoboy- commented 5 years ago

could you please paste some more worker log?

dungnt081191 commented 5 years ago

hi @Technoboy-
yes okay , but lots of log during that time . as i said above , the duplicate message happen during scale down the number of worker pod ( while terminating ) then here is the log after scale down pod 2 -> 1

log before scale down

[2019-10-29 02:35:13,758] INFO msg: 79473479-d226-4447-b8f9-f07a475f0a65 handling task completed, results:true, at: 1572316513758, took:306 (org.apache.helix.messaging.handling.HelixTask:170)
[2019-10-29 02:35:13,760] INFO Disconnecting from 10.111.176.23:9092 (kafka.producer.SyncProducer:66)
[2019-10-29 02:35:13,761] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Try find leader for topic: ###########5, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 02:35:13,761] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Try find leader for topic: ###########4, partition:2 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 02:35:13,761] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Try find leader for topic: ###########3, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 02:35:13,762] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Try find leader for topic: ###########2, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 02:35:13,762] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Try find leader for topic: ###########, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 02:35:13,762] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Find leader for partitions finished, took: 169 ms (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
2019-10-29T02:35:13.763+0000: Total time for which application threads were stopped: 0.0002628 seconds, Stopping threads took: 0.0000841 seconds
2019-10-29T02:35:13.763+0000: Total time for which application threads were stopped: 0.0002773 seconds, Stopping threads took: 0.0000482 seconds
2019-10-29T02:35:13.766+0000: Total time for which application threads were stopped: 0.0001978 seconds, Stopping threads took: 0.0000240 seconds
2019-10-29T02:35:13.767+0000: Total time for which application threads were stopped: 0.0002218 seconds, Stopping threads took: 0.0000408 seconds
2019-10-29T02:35:13.767+0000: Total time for which application threads were stopped: 0.0002151 seconds, Stopping threads took: 0.0000431 seconds
[2019-10-29 02:35:13,787] INFO Subcribing to path:/uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES took:79 (org.apache.helix.manager.zk.CallbackHandler:356)
2019-10-29T02:35:13.809+0000: Total time for which application threads were stopped: 0.0009120 seconds, Stopping threads took: 0.0000556 seconds
[2019-10-29 02:35:13,827] INFO No Messages to process (org.apache.helix.messaging.handling.HelixTaskExecutor:619)
[2019-10-29 02:35:13,827] INFO 20 END:INVOKE /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 119ms (org.apache.helix.manager.zk.CallbackHandler:266)
[2019-10-29 02:35:13,852] INFO [CompactConsumerFetcherManager-1572271233359] Fetcher Thread for topic partitions: ArrayBuffer([###########4-2, InitialOffset -1] , [###########3-0, InitialOffset 277] , [###########5-0, InitialOffset 91470] , [###########2-0, InitialOffset -1] , [###########-0, InitialOffset 6621] ) is CompactConsumerFetcherThread-kloak-mirrormaker-dc-test2_kloakmms01-dc-0-0 (kafka.mirrormaker.CompactConsumerFetcherManager:66)
[2019-10-29 02:35:13,852] INFO [CompactConsumerFetcherManager-1572271233359] Added fetcher for partitions ArrayBuffer([###########4-2, initOffset -1 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [###########3-0, initOffset 277 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [###########5-0, initOffset 91470 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [###########2-0, initOffset -1 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [###########-0, initOffset 6621 to broker BrokerEndPoint(0,10.111.176.23,9092)] ) (kafka.mirrormaker.CompactConsumerFetcherManager:66)
2019-10-29T02:35:13.894+0000: Total time for which application threads were stopped: 0.0004547 seconds, Stopping threads took: 0.0000734 seconds
[2019-10-29 02:35:13,914] INFO 20 START:INVOKE /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor (org.apache.helix.manager.zk.CallbackHandler:194)
[2019-10-29 02:35:13,914] INFO HelixMirrorMaker-1572271233140 subscribes child-change. path: /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@4738a206 (org.apache.helix.manager.zk.CallbackHandler:275)
[2019-10-29 02:35:13,993] INFO Subcribing to path:/uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES took:79 (org.apache.helix.manager.zk.CallbackHandler:356)
[2019-10-29 02:35:14,034] INFO No Messages to process (org.apache.helix.messaging.handling.HelixTaskExecutor:619)
[2019-10-29 02:35:14,034] INFO 20 END:INVOKE /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572271233140/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 120ms (org.apache.helix.manager.zk.CallbackHandler:266)
2019-10-29T02:35:14.895+0000: Total time for which application threads were stopped: 0.0004136 seconds, Stopping threads took: 0.0001164 seconds
2019-10-29T02:35:15.895+0000: Total time for which application threads were stopped: 0.0003663 seconds, Stopping threads took: 0.0000618 seconds
2019-10-29T02:35:18.896+0000: Total time for which application threads were stopped: 0.0003273 seconds, Stopping threads took: 0.0000600 seconds
2019-10-29T02:35:19.462+0000: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 4194304 bytes, new threshold 15 (max 15)
- age   1:      53688 bytes,      53688 total
- age   2:      51408 bytes,     105096 total
- age   3:      48800 bytes,     153896 total
- age   4:      46680 bytes,     200576 total
- age   5:      47976 bytes,     248552 total
- age   6:      48464 bytes,     297016 total
- age   7:      48192 bytes,     345208 total
- age   8:     243256 bytes,     588464 total
- age   9:     271504 bytes,     859968 total
- age  10:      17040 bytes,     877008 total
- age  11:      17016 bytes,     894024 total
- age  12:      17688 bytes,     911712 total
- age  13:      22096 bytes,     933808 total
- age  14:      16944 bytes,     950752 total
- age  15:      17064 bytes,     967816 total
 45287.533: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 2120, predicted base time: 8.23 ms, remaining time: 191.77 ms, target pause time: 200.00 ms]
 45287.533: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1 regions, survivors: 1 regions, predicted young region time: 0.52 ms]
 45287.533: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 8.75 ms, target pause time: 200.00 ms]
2019-10-29T02:35:19.468+0000: [SoftReference, 0 refs, 0.0009323 secs]2019-10-29T02:35:19.469+0000: [WeakReference, 0 refs, 0.0002444 secs]2019-10-29T02:35:19.469+0000: [FinalReference, 0 refs, 0.0002698 secs]2019-10-29T02:35:19.469+0000: [PhantomReference, 0 refs, 0 refs, 0.0003146 secs]2019-10-29T02:35:19.469+0000: [JNI Weak Reference, 0.0000225 secs], 0.0075687 secs]
   [Parallel Time: 4.9 ms, GC Workers: 2]

log of pod scale down


- age   3:      17808 bytes,      83344 total
- age   4:      18048 bytes,     101392 total
- age   5:      18344 bytes,     119736 total
- age   6:      17296 bytes,     137032 total
- age   7:      17864 bytes,     154896 total
- age   8:      17952 bytes,     172848 total
- age   9:      17760 bytes,     190608 total
- age  10:      21280 bytes,     211888 total
- age  11:      17232 bytes,     229120 total
- age  12:      17088 bytes,     246208 total
- age  13:      17472 bytes,     263680 total
- age  14:      17592 bytes,     281272 total
- age  15:      17352 bytes,     298624 total
 46039.679: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 1857, predicted base time: 9.85 ms, remaining time: 190.15 ms, target pause time: 200.00 ms]
 46039.679: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1 regions, survivors: 1 regions, predicted young region time: 0.19 ms]
 46039.679: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 10.04 ms, target pause time: 200.00 ms]
2019-10-29T02:51:41.671+0000: [SoftReference, 0 refs, 0.0001568 secs]2019-10-29T02:51:41.671+0000: [WeakReference, 0 refs, 0.0001058 secs]2019-10-29T02:51:41.671+0000: [FinalReference, 0 refs, 0.0000782 secs]2019-10-29T02:51:41.671+0000: [PhantomReference, 0 refs, 0 refs, 0.0001795 secs]2019-10-29T02:51:41.671+0000: [JNI Weak Reference, 0.0000160 secs], 0.0046632 secs]
   [Parallel Time: 3.6 ms, GC Workers: 2]
      [GC Worker Start (ms): Min: 46039679.5, Avg: 46039679.6, Max: 46039679.6, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 1.0, Avg: 1.1, Max: 1.2, Diff: 0.1, Sum: 2.2]
      [Update RS (ms): Min: 1.4, Avg: 1.4, Max: 1.4, Diff: 0.0, Sum: 2.8]
         [Processed Buffers: Min: 3, Avg: 5.5, Max: 8, Diff: 5, Sum: 11]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.9, Avg: 0.9, Max: 0.9, Diff: 0.0, Sum: 1.9]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [GC Worker Total (ms): Min: 3.4, Avg: 3.4, Max: 3.5, Diff: 0.1, Sum: 6.9]
      [GC Worker End (ms): Min: 46039683.0, Avg: 46039683.0, Max: 46039683.0, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.0 ms]
   [Other: 1.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.7 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.0 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 8192.0K(8192.0K)->0.0B(8192.0K) Survivors: 8192.0K->8192.0K Heap: 126.4M(144.0M)->118.6M(144.0M)]
 [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-10-29T02:51:41.672+0000: Total time for which application threads were stopped: 0.0052173 seconds, Stopping threads took: 0.0000581 seconds
rpc error: code = Unknown desc = Error: No such container: 0b6056d43018c07310ccc8a62a644b6f2badbf5519c02b09b181f8510b23ee43%                                  
xhl1988 commented 5 years ago

Can you provide all the logs (at least after the beginning of shutdown) on the worker that being terminated? Very likely it's not shutdown gracefully and commit the offset.

dungnt081191 commented 5 years ago

hi @xhl1988
The log of worker above is from : before action Scale down , and log during scale down of worker. Below is log of controller .

how to check it shutdown graceful ? . Cause of i write a kubernetes deployment file and scale up / down by update scales field . from 1 -> 4, 4 -> 3 ... Have i send api request shutdown or kill -9 process in pod before scale down ? my k8s deployment file :

apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
  name: dc-nonprod-ureplicator-controller
  namespace: ureplicator
  labels:
    app: dc-nonprod-ureplicator
    component: controller
spec:
  replicas: 1
  progressDeadlineSeconds: 120
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # how many pods we can add at a time
      maxUnavailable: 0  # maxUnavailable define how many pods can be unavailable # during the rolling update
  selector:
    matchLabels:
      app: dc-nonprod-ureplicator
      component: controller
  template:
    metadata:
      labels:
        app: dc-nonprod-ureplicator
        component: controller
    spec:
      containers:
      - name: dc-nonprod-ureplicator-controller
        image: dungnt081191/ureplicator:0.32
        imagePullPolicy: Always
        env:
        - name: SERVICE_TYPE
          value: "controller"
        envFrom:
        - configMapRef:
            name: dc-nonprod-ureplicator-envs
        ports:
        - name: api-port
          containerPort: 9000
        resources:
          requests:
            cpu: 500m
            memory: 1500Mi
          # limits:
          #   cpu: 1000m
          #   memory: 3000Mi
        volumeMounts:
        - name: ureplicator-key
          mountPath: "/uReplicator/bin/kafka-truststore.jks"
          subPath: kafka-truststore.jks
          readOnly: true
        - name: ureplicator-key
          mountPath: "/uReplicator/bin/manager.jks"
          subPath: manager.jks
          readOnly: true
      restartPolicy: Always
      volumes:
      - name: ureplicator-key
        secret:
          secretName: ureplicator-key
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: dc-nonprod-ureplicator-worker
  namespace: ureplicator
  labels:
    app: dc-nonprod-ureplicator
    component: worker
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dc-nonprod-ureplicator
      component: worker
  template:
    metadata:
      labels:
        app: dc-nonprod-ureplicator
        component: worker
    spec:
      containers:
      - name: dc-nonprod-ureplicator-worker
        image: dungnt081191/ureplicator:0.32
        env:
        - name: SERVICE_TYPE
          value: "worker"
        envFrom:
        - configMapRef:
            name: dc-nonprod-ureplicator-envs
        resources:
          requests:
            cpu: 100m
            memory: 1000Mi
          # limits:
          #   cpu: 1000m
          #   memory: 3000Mi
        volumeMounts:
        - name: ureplicator-key
          mountPath: "/uReplicator/bin/kafka-truststore.jks"
          subPath: kafka-truststore.jks
          readOnly: true
        - name: ureplicator-key
          mountPath: "/uReplicator/bin/manager.jks"
          subPath: manager.jks
          readOnly: true
      volumes:
      - name: ureplicator-key
        secret:
          secretName: ureplicator-key

Log of Controller

2019-10-29 03:24:34,700] INFO added 1 new topics to topic list in zk child change (com.uber.stream.kafka.mirrormaker.common.core.KafkaBrokerTopicObserver:141)
2019-10-29T03:24:41.725+0000: Total time for which application threads were stopped: 0.0008572 seconds, Stopping threads took: 0.0001218 seconds
[2019-10-29 03:24:41,724] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:24:43,937] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:24:43,937] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:24:51,674] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83)
[2019-10-29 03:24:51,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:24:53,894] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:24:53,894] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:24:55,881] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572270513534.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-10-29T03:24:55.882+0000: Total time for which application threads were stopped: 0.0006868 seconds, Stopping threads took: 0.0000676 seconds
[2019-10-29 03:24:55,881] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572271463233.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-10-29 03:24:55,882] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572270513653.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-10-29 03:25:01,714] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:03,892] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:03,893] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:11,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:13,901] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:13,901] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:21,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:23,910] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:23,910] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:31,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:33,894] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:33,895] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:41,714] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:43,897] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:43,898] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:51,673] INFO Trying to run the validation job (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:83)
[2019-10-29 03:25:51,712] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:25:53,888] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:25:53,888] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:25:55,873] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572270513534.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-10-29 03:25:55,873] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572271463233.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-10-29 03:25:55,873] WARN Got exception when removing metrics for externalView.topicPartitions.HelixMirrorMaker-1572270513653.totalNumber (com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager:319)
java.lang.IllegalStateException: Not initialized yet
    at com.google.common.base.Preconditions.checkState(Preconditions.java:459)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.checkState(KafkaUReplicatorMetricsReporter.java:207)
    at com.uber.stream.ureplicator.common.KafkaUReplicatorMetricsReporter.get(KafkaUReplicatorMetricsReporter.java:178)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.updatePerWorkerEVMetrics(ValidationManager.java:317)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager.validateExternalView(ValidationManager.java:210)
    at com.uber.stream.kafka.mirrormaker.controller.validation.ValidationManager$1.run(ValidationManager.java:84)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2019-10-29 03:26:01,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:26:03,895] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:26:03,896] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:26:11,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:26:13,899] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:26:13,899] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:26:21,714] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:26:23,894] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:26:23,894] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:26:31,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:26:33,904] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:26:33,904] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
[2019-10-29 03:26:41,713] INFO Trying to run topic whitelisting job (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:124)
[2019-10-29 03:26:43,898] INFO BlacklistedTopics=[] and ExcludingPattern=__consumer_offsets (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:171)
[2019-10-29 03:26:43,899] INFO Trying to whitelist topics: [] (com.uber.stream.kafka.mirrormaker.controller.core.AutoTopicWhitelistingManager:132)
Technoboy- commented 5 years ago

search worker log, find key words like "Start clean shutdown" or "Flushing last batches and commit offsets" which mean gracefully shutdown。under gracefully shutdown, message offset will commit。otherwise, you may kill the pod with incorrect manner

dungnt081191 commented 5 years ago

@Technoboy- sorry, how to stop correctly pod to commit offset before shuting down this pod ?. Have i include another command in startController/Worker sh command ?

Technoboy- commented 5 years ago

reference : https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html you'd better kill the worker process using shell before killing the pod.

dungnt081191 commented 5 years ago

@Technoboy-

not working : i tried this one :

#!/bin/sh -e

PID=$(pgrep -f "Dapp_name=uReplicator-Worker" || true)
if [ -n "$PID" ]; then
 kill -9 ${PID}
fi

and log pod when prestop run from kubernetes :

[2019-10-29 06:59:50,088] INFO Fetching metadata from broker BrokerEndPoint(0,10.111.176.23,9092) with correlation id 5 for 3 topic(s) Set(##########, ###############################, #################) (kafka.client.ClientUtils$:66)
[2019-10-29 06:59:50,117] INFO 20 START:INVOKE /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572332239423/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor (org.apache.helix.manager.zk.CallbackHandler:194)
[2019-10-29 06:59:50,118] INFO HelixMirrorMaker-1572332239423 subscribes child-change. path: /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572332239423/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@4738a206 (org.apache.helix.manager.zk.CallbackHandler:275)
[2019-10-29 06:59:50,132] INFO Connected to 10.111.176.23:9092 for producing (kafka.producer.SyncProducer:66)
[2019-10-29 06:59:50,172] INFO Disconnecting from 10.111.176.23:9092 (kafka.producer.SyncProducer:66)
[2019-10-29 06:59:50,175] INFO [kloak-mirrormaker--dc-test2_kloakmms01--dc-leader-finder-thread]: Try find leader for topic: ##########, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 06:59:50,177] INFO [kloak-mirrormaker--dc-test2_kloakmms01--dc-leader-finder-thread]: Try find leader for topic: ###############################, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 06:59:50,177] INFO [kloak-mirrormaker--dc-test2_kloakmms01--dc-leader-finder-thread]: Try find leader for topic: #################, partition:0 (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 06:59:50,178] INFO [kloak-mirrormaker--dc-test2_kloakmms01--dc-leader-finder-thread]: Find leader for partitions finished, took: 176 ms (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 06:59:50,201] INFO Subcribing to path:/uReplicatorDev/INSTANCES/HelixMirrorMaker-1572332239423/MESSAGES took:83 (org.apache.helix.manager.zk.CallbackHandler:356)
[2019-10-29 06:59:50,241] INFO No Messages to process (org.apache.helix.messaging.handling.HelixTaskExecutor:619)
[2019-10-29 06:59:50,241] INFO 20 END:INVOKE /uReplicatorDev/INSTANCES/HelixMirrorMaker-1572332239423/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 124ms (org.apache.helix.manager.zk.CallbackHandler:266)
[2019-10-29 06:59:50,303] INFO [CompactConsumerFetcherManager-1572332239637] Fetcher Thread for topic partitions: ArrayBuffer([##########-0, InitialOffset 679] , [#################-0, InitialOffset -1] , [###############################-0, InitialOffset -1] ) is CompactConsumerFetcherThread-kloak-mirrormaker--dc-test2_kloakmms01--dc-0-0 (kafka.mirrormaker.CompactConsumerFetcherManager:66)
[2019-10-29 06:59:50,311] INFO [CompactConsumerFetcherManager-1572332239637] Added fetcher for partitions ArrayBuffer([##########-0, initOffset 679 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [#################-0, initOffset -1 to broker BrokerEndPoint(0,10.111.176.23,9092)] , [###############################-0, initOffset -1 to broker BrokerEndPoint(0,10.111.176.23,9092)] ) (kafka.mirrormaker.CompactConsumerFetcherManager:66)
2019-10-29T06:59:50.849+0000: Total time for which application threads were stopped: 0.0003164 seconds, Stopping threads took: 0.0000538 seconds
2019-10-29T06:59:51.850+0000: Total time for which application threads were stopped: 0.0003843 seconds, Stopping threads took: 0.0000568 seconds
2019-10-29T06:59:52.851+0000: Total time for which application threads were stopped: 0.0007836 seconds, Stopping threads took: 0.0000786 seconds
2019-10-29T06:59:53.851+0000: Total time for which application threads were stopped: 0.0002985 seconds, Stopping threads took: 0.0000498 seconds
2019-10-29T06:59:54.851+0000: Total time for which application threads were stopped: 0.0002572 seconds, Stopping threads took: 0.0000433 seconds
2019-10-29T06:59:55.852+0000: Total time for which application threads were stopped: 0.0003113 seconds, Stopping threads took: 0.0000635 seconds
2019-10-29T06:59:56.852+0000: Total time for which application threads were stopped: 0.0003098 seconds, Stopping threads took: 0.0000564 seconds
2019-10-29T06:59:57.853+0000: Total time for which application threads were stopped: 0.0004281 seconds, Stopping threads took: 0.0000470 seconds
2019-10-29T06:59:58.853+0000: Total time for which application threads were stopped: 0.0003280 seconds, Stopping threads took: 0.0001100 seconds
2019-10-29T06:59:59.854+0000: Total time for which application threads were stopped: 0.0003383 seconds, Stopping threads took: 0.0000426 seconds
2019-10-29T07:00:00.854+0000: Total time for which application threads were stopped: 0.0002815 seconds, Stopping threads took: 0.0000482 seconds
2019-10-29T07:00:01.855+0000: Total time for which application threads were stopped: 0.0002605 seconds, Stopping threads took: 0.0000483 seconds
2019-10-29T07:00:02.855+0000: Total time for which application threads were stopped: 0.0003141 seconds, Stopping threads took: 0.0001119 seconds
2019-10-29T07:00:03.856+0000: Total time for which application threads were stopped: 0.0004248 seconds, Stopping threads took: 0.0001298 seconds
2019-10-29T07:00:05.857+0000: Total time for which application threads were stopped: 0.0003363 seconds, Stopping threads took: 0.0000596 seconds
2019-10-29T07:00:07.857+0000: Total time for which application threads were stopped: 0.0004193 seconds, Stopping threads took: 0.0000611 seconds
2019-10-29T07:00:08.858+0000: Total time for which application threads were stopped: 0.0003288 seconds, Stopping threads took: 0.0000555 seconds
2019-10-29T07:00:09.858+0000: Total time for which application threads were stopped: 0.0003776 seconds, Stopping threads took: 0.0001157 seconds
2019-10-29T07:00:10.859+0000: Total time for which application threads were stopped: 0.0003137 seconds, Stopping threads took: 0.0000682 seconds
2019-10-29T07:00:11.859+0000: Total time for which application threads were stopped: 0.0003409 seconds, Stopping threads took: 0.0001035 seconds
2019-10-29T07:00:12.860+0000: Total time for which application threads were stopped: 0.0003121 seconds, Stopping threads took: 0.0000479 seconds
2019-10-29T07:00:13.860+0000: Total time for which application threads were stopped: 0.0002412 seconds, Stopping threads took: 0.0000454 seconds
2019-10-29T07:00:14.861+0000: Total time for which application threads were stopped: 0.0003741 seconds, Stopping threads took: 0.0000601 seconds
2019-10-29T07:00:17.862+0000: Total time for which application threads were stopped: 0.0003564 seconds, Stopping threads took: 0.0001096 seconds
2019-10-29T07:00:18.862+0000: Total time for which application threads were stopped: 0.0004468 seconds, Stopping threads took: 0.0000610 seconds
2019-10-29T07:00:20.863+0000: Total time for which application threads were stopped: 0.0002459 seconds, Stopping threads took: 0.0000561 seconds
2019-10-29T07:00:21.863+0000: Total time for which application threads were stopped: 0.0003930 seconds, Stopping threads took: 0.0001232 seconds
/entrypoint.sh: line 31:    13 Killed                  ./start-worker.sh --consumer.config /uReplicator/config/consumer.properties --producer.config /uReplicator/config/producer.properties --helix.config /uReplicator/config/helix.properties --topic.mappings /uReplicator/config/topicmapping.properties
rpc error: code = Unknown desc = Error: No such container: 2a6aa6c3c227b92a8c246e8a319a1a5f4997c5748128307016617b6cca46d905
Technoboy- commented 5 years ago

if the shell found the worker process and kill it , you can see the key words i said above. so i think the shell does't found the worker process correctly. did you specify the startup param app_name?

dungnt081191 commented 5 years ago

@Technoboy- no no , i did not. my worker start is

if [ "${SERVICE_TYPE}" == "controller" ] ; then
  ./start-controller.sh \
    -port 9000 \
    -helixClusterName "${HELIX_CLUSTER_NAME}" \
    -refreshTimeInSeconds 10 \
    -enableAutoWhitelist true \
    -srcKafkaZkPath "${SRC_ZK_CONNECT}" \
    -destKafkaZkPath "${DST_ZK_CONNECT}" \
    -zookeeper "${SRC_ZK_CONNECT}"

elif [ "${SERVICE_TYPE}" == "worker" ] ; then
  ./start-worker.sh \
    --consumer.config /uReplicator/config/consumer.properties \
    --producer.config /uReplicator/config/producer.properties \
    --helix.config /uReplicator/config/helix.properties \
    --topic.mappings /uReplicator/config/topicmapping.properties
fi

with no app_name specify

my kill process script is

#!/bin/sh -e

PID=$(pgrep -f "Dapp_name=uReplicator-Worker" || true)
if [ -n "$PID" ]; then
 kill -9 ${PID}
fi

If manually check processs id in pod : here is the result

root@ureplicator-worker-7fdffbb6f5-nm5wj:/# ps -ef 
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 07:25 ?        00:00:00 /bin/bash -ex /entrypoint.sh
root          15       1  2 07:25 ?        00:00:19 /usr/local/openjdk-8/bin/java -XX:ParallelGCThreads=2 -Dapp_name=uReplicator-Worker -cp uReplicator-Distribution/target/uReplicator-Distribution-0.1-SN
root         117       0  0 07:34 ?        00:00:00 bash
root         150       0  0 07:39 ?        00:00:00 bash
root         158     150  0 07:39 ?        00:00:00 ps -ef

And i kill the java process with Dapp_name=uReplicator-Worker

dungnt081191 commented 5 years ago

@Technoboy- hey bro , the problem is : kill process or kill -9 process . I did this way with kill -9

I've tried with 'kill processId' and here is result : have commit offset log? And It's Work !!!

2019-10-29T07:45:23.181+0000: Total time for which application threads were stopped: 0.0002563 seconds, Stopping threads took: 0.0000400 seconds
2019-10-29T07:45:40.183+0000: Total time for which application threads were stopped: 0.0001666 seconds, Stopping threads took: 0.0000360 seconds
[2019-10-29 07:45:45,083] INFO Start clean shutdown. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:45,086] INFO Shutting down consumer thread. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:45,089] INFO [mirrormaker-thread-HelixMirrorMaker-1572335099430] mirrormaker-thread-HelixMirrorMaker-1572335099430 shutting down (kafka.mirrormaker.WorkerInstance$MirrorMakerThread:66)
2019-10-29T07:45:45.184+0000: Total time for which application threads were stopped: 0.0001369 seconds, Stopping threads took: 0.0000343 seconds
2019-10-29T07:45:51.185+0000: Total time for which application threads were stopped: 0.0001361 seconds, Stopping threads took: 0.0000360 seconds
[2019-10-29 07:45:51,811] INFO [mirrormaker-thread-HelixMirrorMaker-1572335099430] Mirror maker thread stopped (kafka.mirrormaker.WorkerInstance$MirrorMakerThread:66)
[2019-10-29 07:45:51,811] INFO [mirrormaker-thread-HelixMirrorMaker-1572335099430] Mirror maker thread shutdown complete (kafka.mirrormaker.WorkerInstance$MirrorMakerThread:66)
[2019-10-29 07:45:51,811] INFO Flushing last batches and commit offsets. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:51,812] INFO Flushing producer. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:51,816] INFO Committing offsets. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:51,823] INFO Shutting down consumer connectors. (kafka.mirrormaker.WorkerInstance:66)
[2019-10-29 07:45:51,823] INFO Connector is now shutting down! (kafka.mirrormaker.KafkaConnector:66)
[2019-10-29 07:45:51,838] INFO [CompactConsumerFetcherManager-1572335099655] Stopping leader finder thread (kafka.mirrormaker.CompactConsumerFetcherManager:66)
[2019-10-29 07:45:51,839] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Shutting down (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 07:45:51,839] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Stopped (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 07:45:51,839] INFO [kloak-mirrormaker-dc-test2_kloakmms01-dc-leader-finder-thread]: Shutdown completed (kafka.mirrormaker.CompactConsumerFetcherManager$LeaderFinderThread:66)
[2019-10-29 07:45:51,840] INFO [CompactConsumerFetcherManager-1572335099655] Stopping all fetchers (kafka.mirrormaker.CompactConsumerFetcherManager:66)
[2019-10-29 07:45:51,844] INFO [CompactConsumerFetcherManager-1572335099655] All connections stopped (kafka.mirrormaker.CompactConsumerFetcherManager:66)
Technoboy- commented 5 years ago

kill process.

Technoboy- commented 5 years ago

yes ,perfect .

dungnt081191 commented 5 years ago

@Technoboy- in another case, should i use 2 Controller ? , or only 1 Controller

Technoboy- commented 5 years ago

in non-federate mode, more than 1 controller can keep your system HA . if one controller downs, another will take the leader ship

dungnt081191 commented 5 years ago

@Technoboy- but i meeting this trouble , again - duplicate message when have 2 Controller. I already stop graceful Controller like i did with Worker kill processId

Technoboy- commented 5 years ago

worker is responsible for sync message. controller is responsible for managing worker. so duplicate message can only occur in worker.

dungnt081191 commented 5 years ago

hmmmm. it happen when start Controller to replicate autowhitelist = true . So crazy. don't need to wait to shutting down Controller

dungnt081191 commented 5 years ago

@Technoboy-
Do you know what happen when we stop both Controller and Worker in the same time . All pod have graceful stop . Do the topic will be replicate with the correct number of message or not ?

Technoboy- commented 5 years ago

yes, it is.

dungnt081191 commented 5 years ago

@Technoboy- nope , after restart Kubernetes node , Controller is not working until manual delete Controller pod.