linkedin / databus

Source-agnostic distributed change data capture system
Apache License 2.0
3.64k stars 735 forks source link

Issue with Cluster Client when a new client comes up #35

Closed jagadeesh-huliyar closed 9 years ago

jagadeesh-huliyar commented 10 years ago

Hi

I am using Cluster Clients. I have 6 partitions. When the first client is started it receives events for all partitions. When i start second client client-one gets shutdown request for 3 partitions and the relay puller for those are shut down. When i check in Zookeeper it shows 3 partitions online for 1st client.

However when events are produced for partitions assigned to 1st client it does not get them. Even the second client is not getting these events.

It works sometimes and sometimes it does not work. Sometimes when second client comes up the partitions are divided cleanly and both clients events pertaining to their partition. However many times i see above mentioned erroneous behavior.

Below are the logs for first client that appear after the second client is started.

What i have observed is the auto rebalance loggers are not seen for the times where it does not work. Only "shutdown requested" loggers are being seen.

01:09:56.910 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 33 START:INVOKE /pz_mysql_cluster/CONFIGS/PARTICIPANT listener:org.apache.helix.controller.GenericHelixController 01:09:56.910 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - controller_330929239 subscribes child-change. path: /pz_mysql_cluster/CONFIGS/PARTICIPANT, listener: org.apache.helix.controller.GenericHelixController@1c39bf12 01:09:56.921 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onConfigChange 414 o.a.h.c.GenericHelixController - START: GenericClusterController.onConfigChange() 01:09:56.925 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.ReadClusterDataStage.process 41 o.a.h.c.stages.ReadClusterDataStage - START ReadClusterDataStage.process() 01:09:56.936 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.ReadClusterDataStage.process 70 o.a.h.c.stages.ReadClusterDataStage - END ReadClusterDataStage.process(). took: 11 ms 01:09:56.937 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.BestPossibleStateCalcStage.process 48 o.a.h.c.s.BestPossibleStateCalcStage - START BestPossibleStateCalcStage.process() 01:09:56.937 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 131 o.a.h.c.rebalancer.AutoRebalancer - currentMapping: {default-resource_5={pg-stage1-11125-416586186=ONLINE}, default-resource_0={pg-stage1-11125-416586186=ONLINE}, default-resource_4={pg-stage1-11125-416586186=ONLINE}, default-resource_3={pg-stage1-11125-416586186=ONLINE}, default-resource_2={pg-stage1-11125-416586186=ONLINE}, default-resource_1={pg-stage1-11125-416586186=ONLINE}} 01:09:56.937 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 132 o.a.h.c.rebalancer.AutoRebalancer - stateCountMap: {ONLINE=1} 01:09:56.938 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 133 o.a.h.c.rebalancer.AutoRebalancer - liveNodes: [pg-stage1-11125-416586186] 01:09:56.938 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 134 o.a.h.c.rebalancer.AutoRebalancer - allNodes: [pg-stage1-11125-416586186, pg-stage2-11125-207100785] 01:09:56.938 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 135 o.a.h.c.rebalancer.AutoRebalancer - maxPartition: 2147483647 01:09:56.938 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.AutoRebalanceStrategy.computePartitionAssignment 127 o.a.h.c.s.AutoRebalanceStrategy - orphan = [] 01:09:56.939 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 146 o.a.h.c.rebalancer.AutoRebalancer - newMapping: default-resource, {}{default-resource_0={pg-stage1-11125-416586186=ONLINE}, default-resource_1={pg-stage1-11125-416586186=ONLINE}, default-resource_2={pg-stage1-11125-416586186=ONLINE}, default-resource_3={pg-stage1-11125-416586186=ONLINE}, default-resource_4={pg-stage1-11125-416586186=ONLINE}, default-resource_5={pg-stage1-11125-416586186=ONLINE}}{default-resource_0=[pg-stage1-11125-416586186], default-resource_1=[pg-stage1-11125-416586186], default-resource_2=[pg-stage1-11125-416586186], default-resource_3=[pg-stage1-11125-416586186], default-resource_4=[pg-stage1-11125-416586186], default-resource_5=[pg-stage1-11125-416586186]} 01:09:56.939 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.BestPossibleStateCalcStage.process 65 o.a.h.c.s.BestPossibleStateCalcStage - END BestPossibleStateCalcStage.process(). took: 2 ms 01:09:56.940 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.process 47 o.a.h.c.stages.TaskAssignmentStage - START TaskAssignmentStage.process() 01:09:56.940 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.process 78 o.a.h.c.stages.TaskAssignmentStage - END TaskAssignmentStage.process(). took: 0 ms 01:09:56.940 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onConfigChange 420 o.a.h.c.GenericHelixController - END: GenericClusterController.onConfigChange() 01:09:56.940 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 227 o.a.helix.manager.zk.CallbackHandler - 33 END:INVOKE /pz_mysql_cluster/CONFIGS/PARTICIPANT listener:org.apache.helix.controller.GenericHelixController Took: 30ms 01:09:57.229 [ZkClient-EventThread-24-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 24 START:INVOKE /pz_mysql_cluster/LIVEINSTANCES listener:com.linkedin.databus.cluster.DatabusCluster.DatabusHelixWatcher 01:09:57.229 [ZkClient-EventThread-24-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - watcher_pz_mysql_cluster_330929239 subscribes child-change. path: /pz_mysql_cluster/LIVEINSTANCES, listener: com.linkedin.databus.cluster.DatabusCluster$DatabusHelixWatcher@590eb535 01:09:57.229 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 33 START:INVOKE /pz_mysql_cluster/LIVEINSTANCES listener:org.apache.helix.controller.GenericHelixController 01:09:57.230 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - controller_330929239 subscribes child-change. path: /pz_mysql_cluster/LIVEINSTANCES, listener: org.apache.helix.controller.GenericHelixController@1c39bf12 01:09:57.235 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onLiveInstanceChange 355 o.a.h.c.GenericHelixController - START: Generic GenericClusterController.onLiveInstanceChange() 01:09:57.235 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 33 START:INVOKE /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/CURRENTSTATES/346cfe8bf7b0024 listener:org.apache.helix.controller.GenericHelixController 01:09:57.235 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - controller_330929239 subscribes child-change. path: /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/CURRENTSTATES/346cfe8bf7b0024, listener: org.apache.helix.controller.GenericHelixController@1c39bf12 01:09:57.235 [ZkClient-EventThread-24-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 227 o.a.helix.manager.zk.CallbackHandler - 24 END:INVOKE /pz_mysql_cluster/LIVEINSTANCES listener:com.linkedin.databus.cluster.DatabusCluster.DatabusHelixWatcher Took: 6ms 01:09:57.237 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onStateChange 313 o.a.h.c.GenericHelixController - START: GenericClusterController.onStateChange() 01:09:57.238 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.ReadClusterDataStage.process 41 o.a.h.c.stages.ReadClusterDataStage - START ReadClusterDataStage.process() 01:09:57.248 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.ReadClusterDataStage.process 70 o.a.h.c.stages.ReadClusterDataStage - END ReadClusterDataStage.process(). took: 10 ms 01:09:57.248 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.BestPossibleStateCalcStage.process 48 o.a.h.c.s.BestPossibleStateCalcStage - START BestPossibleStateCalcStage.process() 01:09:57.249 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 131 o.a.h.c.rebalancer.AutoRebalancer - currentMapping: {default-resource_5={pg-stage1-11125-416586186=ONLINE}, default-resource_0={pg-stage1-11125-416586186=ONLINE}, default-resource_4={pg-stage1-11125-416586186=ONLINE}, default-resource_3={pg-stage1-11125-416586186=ONLINE}, default-resource_2={pg-stage1-11125-416586186=ONLINE}, default-resource_1={pg-stage1-11125-416586186=ONLINE}} 01:09:57.249 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 132 o.a.h.c.rebalancer.AutoRebalancer - stateCountMap: {ONLINE=1} 01:09:57.249 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 133 o.a.h.c.rebalancer.AutoRebalancer - liveNodes: [pg-stage1-11125-416586186, pg-stage2-11125-207100785] 01:09:57.249 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 134 o.a.h.c.rebalancer.AutoRebalancer - allNodes: [pg-stage1-11125-416586186, pg-stage2-11125-207100785] 01:09:57.249 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 135 o.a.h.c.rebalancer.AutoRebalancer - maxPartition: 2147483647 01:09:57.250 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.AutoRebalanceStrategy.computePartitionAssignment 127 o.a.h.c.s.AutoRebalanceStrategy - orphan = [] 01:09:57.250 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.rebalancer.AutoRebalancer.computeNewIdealState 146 o.a.h.c.rebalancer.AutoRebalancer - newMapping: default-resource, {}{default-resource_0={pg-stage1-11125-416586186=ONLINE}, default-resource_1={pg-stage2-11125-207100785=ONLINE}, default-resource_2={pg-stage1-11125-416586186=ONLINE}, default-resource_3={pg-stage2-11125-207100785=ONLINE}, default-resource_4={pg-stage1-11125-416586186=ONLINE}, default-resource_5={pg-stage2-11125-207100785=ONLINE}}{default-resource_0=[pg-stage1-11125-416586186], default-resource_1=[pg-stage2-11125-207100785], default-resource_2=[pg-stage1-11125-416586186], default-resource_3=[pg-stage2-11125-207100785], default-resource_4=[pg-stage1-11125-416586186], default-resource_5=[pg-stage2-11125-207100785]} 01:09:57.251 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.BestPossibleStateCalcStage.process 65 o.a.h.c.s.BestPossibleStateCalcStage - END BestPossibleStateCalcStage.process(). took: 3 ms 01:09:57.252 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.process 47 o.a.h.c.stages.TaskAssignmentStage - START TaskAssignmentStage.process() 01:09:57.252 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.sendMessages 134 o.a.h.c.stages.TaskAssignmentStage - Sending Message 6c1c138c-b333-4f81-921f-96321026748d to pg-stage1-11125-416586186 transit default-resource_3|[] from:ONLINE to:OFFLINE 01:09:57.252 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.sendMessages 134 o.a.h.c.stages.TaskAssignmentStage - Sending Message dad1b33a-0f61-48fe-b770-72639e9439e5 to pg-stage1-11125-416586186 transit default-resource_4|[] from:ONLINE to:OFFLINE 01:09:57.252 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.sendMessages 134 o.a.h.c.stages.TaskAssignmentStage - Sending Message f1878c09-bcc2-4d3a-a515-54b03f81094c to pg-stage1-11125-416586186 transit default-resource_5|[] from:ONLINE to:OFFLINE 01:09:57.268 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 28 START:INVOKE /pz_mysql_cluster/INSTANCES/pg-stage1-11125-416586186/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor 01:09:57.268 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - pg-stage1-11125-416586186 subscribes child-change. path: /pz_mysql_cluster/INSTANCES/pg-stage1-11125-416586186/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@487bd46a 01:09:57.281 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.TaskAssignmentStage.process 78 o.a.h.c.stages.TaskAssignmentStage - END TaskAssignmentStage.process(). took: 29 ms 01:09:57.281 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.ExternalViewComputeStage.process 57 o.a.h.c.s.ExternalViewComputeStage - START ExternalViewComputeStage.process() 01:09:57.283 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.s.ExternalViewComputeStage.process 153 o.a.h.c.s.ExternalViewComputeStage - END ExternalViewComputeStage.process(). took: 2 ms 01:09:57.283 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onStateChange 320 o.a.h.c.GenericHelixController - END: GenericClusterController.onStateChange() 01:09:57.284 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 227 o.a.helix.manager.zk.CallbackHandler - 33 END:INVOKE /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/CURRENTSTATES/346cfe8bf7b0024 listener:org.apache.helix.controller.GenericHelixController Took: 49ms 01:09:57.284 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.ZKHelixManager.addListener 325 o.a.helix.manager.zk.ZKHelixManager - Added listener: org.apache.helix.controller.GenericHelixController@1c39bf12 for type: CURRENTSTATES to path: /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/CURRENTSTATES/346cfe8bf7b0024 01:09:57.284 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.checkLiveInstancesObservation 516 o.a.h.c.GenericHelixController - controller_330929239 added current-state listener for instance: pg-stage2-11125-207100785, session: 346cfe8bf7b0024, listener: org.apache.helix.controller.GenericHelixController@1c39bf12 01:09:57.284 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 143 o.a.helix.manager.zk.CallbackHandler - 33 START:INVOKE /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/MESSAGES listener:org.apache.helix.controller.GenericHelixController 01:09:57.284 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.subscribeChildChange 236 o.a.helix.manager.zk.CallbackHandler - controller_330929239 subscribes child-change. path: /pz_mysql_cluster/INSTANCES/pg-stage2-11125-207100785/MESSAGES, listener: org.apache.helix.controller.GenericHelixController@1c39bf12 01:09:57.286 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.GenericHelixController.onMessage 336 o.a.h.c.GenericHelixController - START: GenericClusterController.onMessage() 01:09:57.287 [ZkClient-EventThread-33-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.c.stages.ReadClusterDataStage.process 41 o.a.h.c.stages.ReadClusterDataStage - START ReadClusterDataStage.process() 01:09:57.328 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 229 o.a.h.m.handling.HelixTaskExecutor - Scheduling message: 6c1c138c-b333-4f81-921f-96321026748d 01:09:57.329 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 257 o.a.h.m.handling.HelixTaskExecutor - Message: 6c1c138c-b333-4f81-921f-96321026748d handling task scheduled 01:09:57.329 [pool-6-thread-7] INFO o.a.h.messaging.handling.HelixTask.call 78 o.a.h.messaging.handling.HelixTask - handling task: 6c1c138c-b333-4f81-921f-96321026748d begin, at: 1403725197329 01:09:57.329 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 229 o.a.h.m.handling.HelixTaskExecutor - Scheduling message: dad1b33a-0f61-48fe-b770-72639e9439e5 01:09:57.330 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 257 o.a.h.m.handling.HelixTaskExecutor - Message: dad1b33a-0f61-48fe-b770-72639e9439e5 handling task scheduled 01:09:57.330 [pool-6-thread-8] INFO o.a.h.messaging.handling.HelixTask.call 78 o.a.h.messaging.handling.HelixTask - handling task: dad1b33a-0f61-48fe-b770-72639e9439e5 begin, at: 1403725197330 01:09:57.330 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 229 o.a.h.m.handling.HelixTaskExecutor - Scheduling message: f1878c09-bcc2-4d3a-a515-54b03f81094c 01:09:57.331 [pool-6-thread-7] INFO c.l.d.c.r.DatabusV2ClusterRegistrationImpl.onLostPartitionOwnership 865 c.l.d.c.r.D.pz_mysql_cluster - Partition (3) getting removed !! 01:09:57.331 [pool-6-thread-9] INFO o.a.h.messaging.handling.HelixTask.call 78 o.a.h.messaging.handling.HelixTask - handling task: f1878c09-bcc2-4d3a-a515-54b03f81094c begin, at: 1403725197331 01:09:57.331 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.h.m.handling.HelixTaskExecutor.scheduleTask 257 o.a.h.m.handling.HelixTaskExecutor - Message: f1878c09-bcc2-4d3a-a515-54b03f81094c handling task scheduled 01:09:57.331 [pool-6-thread-7] INFO c.l.d.c.DatabusSourcesConnection.stop 513 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - Stopping ... :true 01:09:57.331 [ZkClient-EventThread-28-pg-stage1.nm.flipkart.com:2181,pg-stage2.nm.flipkart.com:2181,pg-stage3.nm.flipkart.com:2181] INFO o.a.helix.manager.zk.CallbackHandler.invoke 227 o.a.helix.manager.zk.CallbackHandler - 28 END:INVOKE /pz_mysql_cluster/INSTANCES/pg-stage1-11125-416586186/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 63ms 01:09:57.332 [pool-6-thread-7] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 749 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_3 - shutting down connection ... 01:09:57.333 [pool-6-thread-7] INFO c.l.databus.client.BasePullThread.shutdown 263 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - mbean unregistered 01:09:57.333 [pool-6-thread-7] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - ClusterId_3-RelayPuller: shutdown requested. 01:09:57.334 [ClusterId_3-RelayPuller] WARN c.l.d.c.a.AbstractActorMessageQueue.enqueueMessage 343 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - ClusterId_3-RelayPuller: shutdown requested: ignoring REQUEST_STREAM 01:09:57.334 [pool-6-thread-7] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - ClusterId_3-RelayDispatcher: shutdown requested. 01:09:57.334 [ClusterId_3-RelayPuller] INFO c.l.databus.client.RelayPullThread.onShutdown 176 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_3 - closing open connection during onShutdown() 01:09:57.335 [pool-6-thread-7] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 772 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_3 - connection shut down. 01:09:57.335 [ClusterId_3-RelayDispatcher] WARN c.l.d.c.DbusEventBuffer$DbusEventIterator.await 720 c.l.databus.core.DbusEventBuffer - DbusEventIterator: {identifier: ClusterId_3-RelayDispatcher.iter-2048003251, currentPosition: 1629711:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1629711], iteratorTail: 1629711:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1629711], lockToken={ownerName:ClusterId_3-RelayDispatcher.iter-2048003251, range:Range [start=1629711, end=1629711], created:1403725050480, lastUpdated:1403725099090}}: await/refresh interrupted

jagadeesh-huliyar commented 10 years ago

Another thing that i can see is that in the cases where it does not work i am not getting callback in onDropPartition of PartitionListener in first client. In cases where it works i am getting callback in onDropPartition in PartitionListener in first client

chavdar commented 10 years ago

What does ZK say about the state of the partitions that got reassigned?

On Wed, Jun 25, 2014 at 1:56 PM, jagadeesh-huliyar <notifications@github.com

wrote:

Another thing that i can see is that in the cases where it does not work i am not getting callback in onDropPartition of PartitionListener in first client. In cases where it works i am getting callback in onDropPartition in PartitionListener in first client

— Reply to this email directly or view it on GitHub https://github.com/linkedin/databus/issues/35#issuecomment-47157641.

jagadeesh-huliyar commented 10 years ago

ZK has correct data. Checked ZK. In INSTANCES it shows correct data. FOr each instance it shows that 3 partitions are ONLINE. Sometimes it works. Sometimes it does not work. The log is of the instance when it did not work.

chavdar commented 10 years ago

Can you try querying on the client admin port?

On Thu, Jun 26, 2014 at 10:16 AM, jagadeesh-huliyar < notifications@github.com> wrote:

ZK has correct data. Checked ZK. In INSTANCES it shows correct data. FOr each instance it shows that 3 partitions are ONLINE. Sometimes it works. Sometimes it does not work. The log is of the instance when it did not work.

— Reply to this email directly or view it on GitHub https://github.com/linkedin/databus/issues/35#issuecomment-47253301.

jagadeesh-huliyar commented 10 years ago

Below is the output of /clientState/registrations?pretty for both clients. The id's being shown are -1 and -32768.

The code in my partition listener is this.
reg.withRegId(new RegistrationId("ClusterId_" + partitionInfo.getPartitionId()));

So i was expecting that the id's would be ClusterId_0, ClusterId_1 .... Client-1 is ...

{ "pz_mysqlcluster" : [ { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionPGResponse", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionPGResponse", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionFingerprint", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionFingerprint", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionSaleDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionSaleDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionUserDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionUserDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.databus_sandbox.alltypes", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.databussandbox.alltypes", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionCardDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionCardDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionMaster", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionMaster", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionFraudStatus", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionFraudStatus", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "*", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionAuditSummary", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionAuditSummary", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } } ] }

Client-2 is ....

{ "pz_mysqlcluster" : [ { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionPGResponse", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionPGResponse", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionFingerprint", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionFingerprint", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionSaleDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionSaleDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionUserDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionUserDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.databus_sandbox.alltypes", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.databussandbox.alltypes", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionCardDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionCardDetail", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionMaster", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionMaster", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionFraudStatus", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionFraudStatus", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } }, { "physicalSource" : { "role" : { "role" : "ANY" }, "uri" : "databus:physical-source:ANY", "anySourceWildcard" : true, "wildcard" : true, "resourceKey" : "", "masterSourceWildcard" : false, "slaveSourceWildcard" : false }, "physicalPartition" : { "name" : "*", "id" : -1, "anyPartitionWildcard" : true, "wildcard" : true }, "logicalPartition" : { "id" : -1, "source" : { "name" : "pz.payment.TransactionAuditSummary", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false }, "wildcard" : true, "allPartitionsWildcard" : true }, "logicalSource" : { "name" : "pz.payment.TransactionAuditSummary", "id" : -32768, "allSourcesWildcard" : false, "wildcard" : false } } ] }

When i try to get for a registration i get .... http://localhost:11125/clientState/registration/ClusterId_0?pretty

{"message":"Infinite recursion (StackOverflowError) (through reference chain: com.linkedin.databus.client.registration.DatabusV2ClusterRegistrationImpl[\"regMap\"]->java.util.HashMap[\"1\"]->com.linkedin.databus.client.registration.DatabusClusterChildRegistrationImpl[\"parent\"]->com.linkedin.databus.client.registration.DatabusV2ClusterRegistrationImpl[\"regMap\"]->java.util.HashMap[\"1\"]->com.linkedin.databus.client.registration.DatabusClusterChildRegistrationImpl[\"parent\"]->com.linkedin.databus.client.registration.DatabusV2ClusterRegistrationImpl[\"regMap\"]->java.util.HashMap[\"1\"]-

on Zookeeper i can see the following for client-1 ... which is not working after client-2 comes up.

[zk: localhost:2181(CONNECTED) 2] get /pz_mysql_cluster/INSTANCES/pg-stage1-11125-374018086/CURRENTSTATES/246d4d310890000/default-resource { "id":"default-resource" ,"simpleFields":{ "BATCH_MESSAGE_MODE":"false" ,"BUCKET_SIZE":"0" ,"SESSION_ID":"246d4d310890000" ,"STATE_MODEL_DEF":"OnlineOffline" ,"STATE_MODEL_FACTORY_NAME":"DEFAULT" } ,"listFields":{ } ,"mapFields":{ "default-resource_0":{ "CURRENT_STATE":"ONLINE" } ,"default-resource_1":{ "CURRENT_STATE":"ONLINE" } ,"default-resource_2":{ "CURRENT_STATE":"ONLINE" } } } cZxid = 0x4000001a57 ctime = Fri Jun 27 18:30:06 IST 2014 mZxid = 0x4000001ab5 mtime = Fri Jun 27 18:42:18 IST 2014 pZxid = 0x4000001a57 cversion = 0 dataVersion = 10 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 463 numChildren = 0

jagadeesh-huliyar commented 10 years ago

There is one more very surprising thing. When i am running both clients on one machine (with different ports) it is working fine. I am running both clients on my local mac. When i am running both clients on my mac but connect the clients to staging relay server it is still working fine.

The issue only comes up when i am running the clients on separate machines in staging or production servers.

jagadeesh-huliyar commented 10 years ago

Went through the code, logs and the thread dump. I can see the following : The rebalancer runs and it tells the client to shut down certain partitions that it is running. The client tries to shutdown the partitions. However this process never completes.

Some partitions are shutdown, but remain and are stuck. From various logs (from many runs whenever the issue was reproduced) i feel this could be due to thread safety issue. Firstly it is not reproducible always.

I can see the following in the code. In the stop() method of DatabusSourcesConnection class firstly the RelayPuller and RelayDispatcher are sent a request to shutdown. Since these are running in different threads the request sent is by setting a volatile variable to not null. The Puller and Dispatcher are running in different threads are they keep checking for "non-null" of this variable. In the stop() method again after sending the request there is call to awaitShutdown() methods of relay puller and relay dispatcher. Here basically it tries to acquire a lock and waits on a condition (of shutdown being set by the relay puller and dispatcher).

I can see from the logs that the after sending request to shutdown there are loggers before acquiring the lock "waiting for shutdown" ... but no loggers after acquiring the lock. So it is waiting to acquire the lock.

However in the other thread the Relay Puller has acquired the lock and will relinquish it only when it gets signal to shutdown. However this signal is the form of a null/non-null setting if a volatile variable.

This is definitely not safe specially in a multi core/processor system. An Atomic boolean should have been used or adequate locking or synchronization should have been done. It can so happen that a non-null value is written to the variable but it is never flushed to memory and remains in processor cache only. It will be guaranteed to be written to memory only if java feels that it can be accessed by multiple threads. So either atomic boolean should have been used or synchronization/locking should have been used to make it thread safe.

Also if i assume that the Relay Puller and Dispatcher are no longer alive then i should have still seen the logger _log.info("Stopped ... "); in stop method at the end, which i don't see.

I can also see from the thread dumps taken that the thread is still waiting on the condition or the lock in awaitshutdown() method. However sometimes the RelayPuller thread (For a partition that has not been shutdown) is not there in the thread dump but the logger at the end of stop method is still not there.

Pasting snippets of logs from various instances Instance-1 04:38:28.160 [pool-7-thread-9] INFO c.l.d.c.r.DatabusV2ClusterRegistrationImpl.onLostPartitionOwnership 865 c.l.d.c.r.D.pz_mysql_cluster - Partition (5) getting removed !! 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection.stop 513 ClusterId_5 - Stopping ... :true 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 749 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5 - shutting down connection ... 04:38:28.161 [pool-7-thread-9] INFO c.l.databus.client.BasePullThread.shutdown 263 ClusterId_5 - mbean unregistered 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 ClusterId_5 - ClusterId_5-RelayPuller: shutdown requested. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 ClusterId_5 - ClusterId_5-RelayDispatcher: shutdown requested. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 772 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5 - connection shut down. 04:38:28.162 [ClusterId_5-RelayDispatcher] WARN c.l.d.c.DbusEventBuffer$DbusEventIterator.await 720 c.l.databus.core.DbusEventBuffer - DbusEventIterator: {identifier: ClusterId_5-RelayDispatcher.iter-941066641, currentPosition: 1695605:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1695605], iteratorTail: 1695605:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1695605], lockToken={ownerName:ClusterId_5-RelayDispatcher.iter-941066641, range:Range [start=1695605, end=1695605], created:1403910400796, lastUpdated:1403910453542}}: await/refresh interrupted 04:38:28.162 [ClusterId_5-Nanny] INFO c.l.d.c.DatabusSourcesConnection$NannyRunnable.run 704 ClusterId_5 - nanny: who woke me up? 04:38:28.163 [ClusterId_5-RelayDispatcher] INFO c.l.d.c.a.AbstractActorMessageQueue.performShutdown 118 ClusterId_5 - ClusterId_5-RelayDispatcher shutdown. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection.stop 520 ClusterId_5 - shutting down relay puller ... 04:38:28.163 [ClusterId_5-RelayDispatcher] INFO c.l.d.c.a.AbstractActorMessageQueue.run 291 ClusterId_5 - Message Queue History (earliest first) at shutdown:[DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED] 04:38:28.163 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.awaitShutdown 381 ClusterId_5 - ClusterId_5-RelayPuller: waiting for shutdown

Instance-2 06:19:52.779 [pool-7-thread-8] INFO c.l.d.c.r.DatabusV2ClusterRegistrationImpl.onLostPartitionOwnership 865 c.l.d.c.r.D.pz_mysql_cluster - Partition (5) getting removed !! 06:19:52.780 [pool-7-thread-8] INFO c.l.d.c.DatabusSourcesConnection.stop 513 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5 - Stopping ... :true 06:19:52.780 [pool-7-thread-8] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 749 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5 - shutting down connection ... 06:19:52.781 [pool-7-thread-8] INFO c.l.databus.client.BasePullThread.shutdown 263 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5 - mbean unregistered 06:19:52.781 [pool-7-thread-8] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5 - ClusterId_5-RelayPuller: shutdown requested. 06:19:52.781 [ClusterId_5-RelayPuller] WARN c.l.d.c.a.AbstractActorMessageQueue.enqueueMessage 343 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5 - ClusterId_5-RelayPuller: shutdown requested: ignoring REQUEST_STREAM 06:19:52.781 [pool-7-thread-8] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5 - ClusterId_5-RelayDispatcher: shutdown requested.

phanindraganti commented 10 years ago

Hi Jagadish,

               Thanks for describing the issue in detail and pasting in

the relevant logs. I am summarizing the issues observed during various phase of your debugging and likely leads to look further into for each of them.

(1) Starting clients in STG/PROD on different machines runs into the issue where databus does not get shutdown notifications :

(2) A thread synchronization may be a possibility, but I would want to make sure that from ZK we are receiving notifications first. Also, can we look at gc logs around that time and see if they were long enough pauses to cause session to terminate ?

(3) Also, how is a client terminated ? Is it by SIGTERM (kill ) (i.e., no kill -9 ) ?

--Phani

On Fri, Jun 27, 2014 at 6:48 PM, jagadeesh-huliyar <notifications@github.com

wrote:

Went through the code, logs and the thread dump. I can see the following : The rebalancer runs and it tells the client to shut down certain partitions that it is running. The client tries to shutdown the partitions. However this process never completes.

Some partitions are shutdown, but remain and are stuck. From various logs (from many runs whenever the issue was reproduced) i feel this could be due to thread safety issue. Firstly it is not reproducible always.

I can see the following in the code. In the stop() method of DatabusSourcesConnection class firstly the RelayPuller and RelayDispatcher are sent a request to shutdown. Since these are running in different threads the request sent is by setting a volatile variable to not null. The Puller and Dispatcher are running in different threads are they keep checking for "non-null" of this variable. In the stop() method again after sending the request there is call to awaitShutdown() methods of relay puller and relay dispatcher. Here basically it tries to acquire a lock and waits on a condition (of shutdown being set by the relay puller and dispatcher).

I can see from the logs that the after sending request to shutdown there are loggers before acquiring the lock "waiting for shutdown" ... but no loggers after acquiring the lock. So it is waiting to acquire the lock.

However in the other thread the Relay Puller has acquired the lock and will relinquish it only when it gets signal to shutdown. However this signal is the form of a null/non-null setting if a volatile variable.

This is definitely not safe specially in a multi core/processor system. An Atomic boolean should have been used or adequate locking or synchronization should have been done. It can so happen that a non-null value is written to the variable but it is never flushed to memory and remains in processor cache only. It will be guaranteed to be written to memory only if java feels that it can be accessed by multiple threads. So either atomic boolean should have been used or synchronization/locking should have been used to make it thread safe.

Also if i assume that the Relay Puller and Dispatcher are no longer alive then i should have still seen the logger _log.info("Stopped ... "); in stop method at the end, which i don't see.

I can also see from the thread dumps taken that the thread is still waiting on the condition or the lock in awaitshutdown() method. However sometimes the RelayPuller thread (For a partition that has not been shutdown) is not there in the thread dump but the logger at the end of stop method is still not there.

Pasting snippets of logs from various instances Instance-1 04:38:28.160 [pool-7-thread-9] INFO c.l.d.c.r.DatabusV2ClusterRegistrationImpl.onLostPartitionOwnership 865 c.l.d.c.r.D.pz_mysql_cluster - Partition (5) getting removed !! 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection.stop 513 ClusterId_5 - Stopping ... :true 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 749 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5

  • shutting down connection ... 04:38:28.161 [pool-7-thread-9] INFO c.l.databus.client.BasePullThread.shutdown 263 ClusterId_5 - mbean unregistered 04:38:28.161 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 ClusterId_5 - ClusterId_5-RelayPuller: shutdown requested. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 ClusterId_5 - ClusterId_5-RelayDispatcher: shutdown requested. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 772 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5
  • connection shut down. 04:38:28.162 [ClusterId_5-RelayDispatcher] WARN c.l.d.c.DbusEventBuffer$DbusEventIterator.await 720 c.l.databus.core.DbusEventBuffer - DbusEventIterator: {identifier: ClusterId_5-RelayDispatcher.iter-941066641, currentPosition: 1695605:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1695605], iteratorTail: 1695605:[GenId=0;Index=0(lim=8388608,cap=8388608);Offset=1695605], lockToken={ownerName:ClusterId_5-RelayDispatcher.iter-941066641, range:Range [start=1695605, end=1695605], created:1403910400796, lastUpdated:1403910453542}}: await/refresh interrupted 04:38:28.162 [ClusterId_5-Nanny] INFO c.l.d.c.DatabusSourcesConnection$NannyRunnable.run 704 ClusterId_5 - nanny: who woke me up? 04:38:28.163 [ClusterId_5-RelayDispatcher] INFO c.l.d.c.a.AbstractActorMessageQueue.performShutdown 118 ClusterId_5 - ClusterId_5-RelayDispatcher shutdown. 04:38:28.162 [pool-7-thread-9] INFO c.l.d.c.DatabusSourcesConnection.stop 520 ClusterId_5 - shutting down relay puller ... 04:38:28.163 [ClusterId_5-RelayDispatcher] INFO c.l.d.c.a.AbstractActorMessageQueue.run 291 ClusterId_5 - Message Queue History (earliest first) at shutdown:[DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherSt ate:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED, DispatcherState:CLOSED] 04:38:28.163 [pool-7-thread-9] INFO c.l.d.c.a.AbstractActorMessageQueue.awaitShutdown 381 ClusterId_5 - ClusterId_5-RelayPuller: waiting for shutdown

Instance-2 06:19:52.779 [pool-7-thread-8] INFO c.l.d.c.r.DatabusV2ClusterRegistrationImpl.onLostPartitionOwnership 865 c.l.d.c.r.D.pz_mysql_cluster - Partition (5) getting removed !! 06:19:52.780 [pool-7-thread-8] INFO c.l.d.c.DatabusSourcesConnection.stop 513 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5

  • Stopping ... :true 06:19:52.780 [pool-7-thread-8] INFO c.l.d.c.DatabusSourcesConnection$SourcesConnectionStatus.shutdown 749 c.l.d.c.DatabusComponentStatus_conn[AnyPPart_alltypes_AnyPPart_TransactionMaster_AnyPPart_TransactionUserDetail_AnyPPart_TransactionSaleDetail_AnyPPart_TransactionAuditSummary_AnyPPart_TransactionFingerprint_AnyPPart_TransactionCardDetail_AnyPPart_TransactionFraudStatus_AnyPPart_TransactionPGResponse]_ClusterId_5
  • shutting down connection ... 06:19:52.781 [pool-7-thread-8] INFO c.l.databus.client.BasePullThread.shutdown 263 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5
  • mbean unregistered 06:19:52.781 [pool-7-thread-8] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5
  • ClusterId_5-RelayPuller: shutdown requested. 06:19:52.781 [ClusterId_5-RelayPuller] WARN c.l.d.c.a.AbstractActorMessageQueue.enqueueMessage 343 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5
  • ClusterId_5-RelayPuller: shutdown requested: ignoring REQUEST_STREAM 06:19:52.781 [pool-7-thread-8] INFO c.l.d.c.a.AbstractActorMessageQueue.shutdown 375 databus_sandbox_alltypes_payment_TransactionAuditSummary_payment_TransactionCardDetail_payment_TransactionFingerprint_payment_TransactionFraudStatus_payment_TransactionMaster_payment_TransactionPGResponse_payment_TransactionSaleDetail_payment_TransactionUserDetail_ClusterId_5
  • ClusterId_5-RelayDispatcher: shutdown requested.

— Reply to this email directly or view it on GitHub https://github.com/linkedin/databus/issues/35#issuecomment-47414975.

jagadeesh-huliyar commented 10 years ago

Hi Phani

Thanks for your reply. I have tried this and reproduced this many times over the weekend and today. I had pasted only relevant logs last time (Otherwise it would have been huge). I can clearly see that notification is being received from Zookeeper.

Also maxClientCnxns is 50 in zookeeper and the number of active connections is only 5.

To keep things simple, i changed the number of partitions to 2. The notification is received and it is being processed.

The issue is that shutdown process for a partition does not complete. I can see the following trace being called. image

The stop method of DatabusSourcesConnection does not complete it's execution. Whenever it completes it's execution the system works fine (Like i said earlier ... It works fine sometimes). I would see following places where it gets stuck in different instances of my run.

  1. Case Number 1 : Stop method checks if RelayPullerThread is alive and only of it is alive it calls awaitShutdown on RelayPuller. In Relay Puller it waits on _shutdownCondition. From my code review i found that signalAll on this condition is called from run method of RelayPuller thread. So this signalAll is not called. In the thread dump i cannot see RelayPullerThread. So it between the check to isAlive and the wait on condition it has died. I am not able to figure out how .... Really confusing ...
  2. Case Number 2 : Thread enters awaitShutdown method but there are no loggers after that. Possibility is that _controlLock is not being acquired.

Similarly there are 2 or 3 more cases where the stop method of DatabusSourcesConnection does not complete execution.

I also changed the _shutdownRequest to Atomic Boolean .... Still no effect ... It acquires the lock sometimes ... but later on it gets stuck somewhere else.

Also since the locks are "java.util.concurrent.locks" and not "synchronized" we cannot figure out who is holding the lock from the thread dump.

I also checked Zookeeper logs. I cannot see any issue.

Moreover when this happens (Client 2 is started) client 1 just gets stuck. Client 2 continues to get updates only for it's partitions and not for that of client 1. Since client 1 is stuck it does not process it's partition.

Answers to your question : Client is being stopped using kill -1 (Shutdown hooks are getting executed).

Some more background information : We plan to have databus as our defacto change propagation system at Flipkart (Flipkart is an Indian e-commerce company). For this we have made changes in the Relay to support multiple sources (Like HBase ... We use NGDATA SEP to feed into Databus as a producer) and another producer based off Netflix Zeno. We have made changes in Relay for this ... See https://github.com/Flipkart/aesop for more info.

Do you think Changes in Relay can affect the client cluster.

One more question: Do you guys use Client Cluster in production. If yes have you guys observed any issue similar to what i have described.

phanindraganti commented 10 years ago

Hi Jagadeesh, Partial answers :

(1) Relay changes should not affect client cluster shutdown (atleast not for the workflow we are speaking ),

(2) This is a primary use-case for us and is used extensively in production.

--Phani

On Mon, Jun 30, 2014 at 1:07 PM, jagadeesh-huliyar <notifications@github.com

wrote:

Hi Phani

Thanks for your reply. I have tried this and reproduced this many times over the weekend and today. I had pasted only relevant logs last time (Otherwise it would have been huge). I can clearly see that notification is being received from Zookeeper.

Also maxClientCnxns is 50 in zookeeper and the number of active connections is only 5.

To keep things simple, i changed the number of partitions to 2. The notification is received and it is being processed.

The issue is that shutdown process for a partition does not complete. I can see the following trace being called. [image: image] https://cloud.githubusercontent.com/assets/4444363/3434888/e6e8933c-008e-11e4-9df9-7c25b820a178.png

The stop method of DatabusSourcesConnection does not complete it's execution. Whenever it completes it's execution the system works fine (Like i said earlier ... It works fine sometimes). I would see following places where it gets stuck in different instances of my run.

  1. Case Number 1 : Stop method checks if RelayPullerThread is alive and only of it is alive it calls awaitShutdown on RelayPuller. In Relay Puller it waits on _shutdownCondition. From my code review i found that signalAll on this condition is called from run method of RelayPuller thread. So this signalAll is not called. In the thread dump i cannot see RelayPullerThread. So it between the check to isAlive and the wait on condition it has died. I am not able to figure out how .... Really confusing ...
  2. Case Number 2 : Thread enters awaitShutdown method but there are no loggers after that. Possibility is that _controlLock is not being acquired.

Similarly there are 2 or 3 more cases where the stop method of DatabusSourcesConnection does not complete execution.

I also changed the _shutdownRequest to Atomic Boolean .... Still no effect ... It acquires the lock sometimes ... but later on it gets stuck somewhere else.

Also since the locks are "java.util.concurrent.locks" and not "synchronized" we cannot figure out who is holding the lock from the thread dump.

I also checked Zookeeper logs. I cannot see any issue.

Moreover when this happens (Client 2 is started) client 1 just gets stuck. Client 2 continues to get updates only for it's partitions and not for that of client 1. Since client 1 is stuck it does not process it's partition.

Answers to your question : Client is being stopped using kill -1 (Shutdown hooks are getting executed).

Some more background information : We plan to have databus as our defacto change propagation system at Flipkart (Flipkart is an Indian e-commerce company). For this we have made changes in the Relay to support multiple sources (Like HBase ... We use NGDATA SEP to feed into Databus as a producer) and another producer based off Netflix Zeno. We have made changes in Relay for this ... See https://github.com/Flipkart/aesop for more info.

Do you think Changes in Relay can affect the client cluster.

One more question: Do you guys use Client Cluster in production. If yes have you guys observed any issue similar to what i have described.

— Reply to this email directly or view it on GitHub https://github.com/linkedin/databus/issues/35#issuecomment-47579942.

jagadeesh-huliyar commented 10 years ago

Everything is working fine. There was an issue with the loggers. We are using slf4j. Somehow after the second client comes up the logs on first client would stop working.

However databus processing is working fine and i can see data being written in the destination data source.

I am really sorry for the inconvenience caused to you guys.

phanindraganti commented 10 years ago

Jagadeesh -

          Glad to note that you are unblocked. For my understanding :

Cheers ! Phani

On Mon, Jun 30, 2014 at 11:30 PM, jagadeesh-huliyar < notifications@github.com> wrote:

Everything is working fine. There was an issue with the loggers. We are using slf4j. Somehow after the second client comes up the logs on first client would stop working.

However databus processing is working fine and i can see data being written in the destination data source.

I am really sorry for the inconvenience caused to you guys.

— Reply to this email directly or view it on GitHub https://github.com/linkedin/databus/issues/35#issuecomment-47621371.

jagadeesh-huliyar commented 10 years ago

Hi Phani

Sorry for the delayed response. Since your questions centered around production status, etc ... i thought i will reply once this status is clear.

We are on production with change propagation system since the past 6 days. Everything is working fine. We also got good numbers during load testing. I can share those if you need.

The issue with log was the following. I had set prudent property as true. This is required if multiple JVM's write to the same file. I had just set this and was using it on local. However ob production there were issues that i had mentioned earlier in this thread. I removed this property and everything is working fine now. So i don't think you have anything to worry about unless you have set this property to true.

Use Case : Our use case is mostly that of change propagation currently. In our team we are building a data platform and we will be having different data stores for different use cases. The data first enters MySql. From there we need to move it to different data stores for different use cases. Some of these use cases are real time in nature and hence we need real time replication of data. This is the primary use case. However once this is stable and we run in in our team for some time, this will be used across teams in Flipkart for this use case as well as other use cases like DC to DC replication. That is why we have added producers for HBase and Netflix Zeno based producers.

I have couple of questions ....

  1. We want to now start handling DDL events (Create, Alter table, etc ...) as well in MySql producer. I saw in the databus code base that there are control events and these are sent to all clients irrespective of partitions. Can these be used for this purpose. How are you guys sending DDL events to the destination.
  2. I have seen the Wiki and your plan for Relay HA. Is there any progress on that. Are some parts of it built, etc ...