streamnative / bookkeeper-achieved

Apache Bookkeeper
https://bookkeeper.apache.org
Apache License 2.0
3 stars 2 forks source link

ISSUE-2760: Bookkeeper shutdown when zookeeper connection just timeout #398

Open sijie opened 3 years ago

sijie commented 3 years ago

Original Issue: apache/bookkeeper#2760


Describe the bug Bookkeeper begin to shutdown after a zookeeper connection timeout. Session is not expired and it was attempting reconnect. In the log, there is a ConnectionLossException when a node was deleted in org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger. It can not be processed in org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and shutdown was called. image It shoul be retried as syncCallWithRetries was called. One cause is that the maxRetries of the retry policy was setted to be 0. The zk client is initialized in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the retrying policy is created in org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize. image The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE?

To Reproduce

Steps to reproduce the behavior:

  1. Run bookkeeper
  2. release ledger lock when zk timeout

LOG: 06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x37b0a37639d001d, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]

06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x47b0a3221890013, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] 06:18:45.859 [PendingReplicationTimer] ERROR org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting underreplicated ledger lock org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1] at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131] at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131] 06:18:45.867 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] ... 4 more 06:18:45.867 [PendingReplicationTimer] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments of ledger 826574 org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1] at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131] at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1] ... 3 more 06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker 06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker 06:18:45.867 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event. 06:18:45.874 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop! 06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop! 06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main] java.lang.RuntimeException: AutoRecovery is not running any more at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] 06:18:46.528 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook. 06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0 06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery 06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.Auditor - Shutting down auditor 06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector 06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker

StevenLuMT commented 2 years ago

bugfix in pr https://github.com/apache/bookkeeper/pull/2958