streamnative / pulsar-archived

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org
Apache License 2.0
73 stars 26 forks source link

ISSUE-15776: bookie failed to decommission successfully #4271

Open sijie opened 2 years ago

sijie commented 2 years ago

Original Issue: apache/pulsar#15776


Describe the bug I have four bookie , one of the bookie process hangs because the disk is full, so I try to clear the data directory and take the service offline. In addition, the number of data copies is three, and the cluster can be used normally when the bookie service hangs up

To Reproduce Steps to reproduce the behavior:

  1. Delete journalDirectories and ledgerDirectories directory data
  2. Execute command:‘ bin/bookkeeper shell listunderreplicated ’
    2022-05-25 16:40:16.0906 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.101.129.65:2181,10.101.129.68:2181,10.101.129.70:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@6c372fe6
    2022-05-25 16:40:16.0911 [main] INFO  org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
    2022-05-25 16:40:16.0917 [main] INFO  org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes
    2022-05-25 16:40:16.0923 [main] INFO  org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
    2022-05-25 16:40:16.0932 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181.
    2022-05-25 16:40:16.0933 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
    2022-05-25 16:40:16.0937 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.101.129.65:35556, server: kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181
    2022-05-25 16:40:16.0941 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181, session id = 0x3063889d93b4263, negotiated timeout = 30000
    2022-05-25 16:40:16.0944 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
    2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
    2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1651199961381
    2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963
    2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1650363984734
    2022-05-25 16:40:17.0037 [main-SendThread(10.101.129.70:2181)] WARN  org.apache.zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x3063889d93b4263.
    org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x3063889d93b4263, likely server has closed socket
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
    2022-05-25 16:40:17.0143 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x3063889d93b4263 closed
    2022-05-25 16:40:17.0143 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3063889d93b4263
  3. Execute command:‘ bin/bookkeeper shell decommissionbookie’
    2022-05-25 16:04:09.0574 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.101.129.65:2181,10.101.129.68:2181,10.101.129.70:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@69f1a286
    2022-05-25 16:04:09.0579 [main] INFO  org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
    2022-05-25 16:04:09.0585 [main] INFO  org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes
    2022-05-25 16:04:09.0592 [main] INFO  org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
    2022-05-25 16:04:09.0601 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181.
    2022-05-25 16:04:09.0601 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
    2022-05-25 16:04:09.0605 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.101.129.65:55334, server: kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181
    2022-05-25 16:04:09.0609 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181, session id = 0x10130f8da34426c, negotiated timeout = 30000
    2022-05-25 16:04:09.0612 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
    2022-05-25 16:04:09.0789 [main] ERROR org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.bookkeeper.net.ScriptBasedMapping, used default subnet resolver 
    java.lang.RuntimeException: No network topology script is found when using script based DNS resolver.
        at org.apache.bookkeeper.net.ScriptBasedMapping$RawScriptBasedMapping.validateConf(ScriptBasedMapping.java:163) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.net.AbstractDNSToSwitchMapping.setConf(AbstractDNSToSwitchMapping.java:81) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.net.ScriptBasedMapping.setConf(ScriptBasedMapping.java:123) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.initialize(RackawareEnsemblePlacementPolicyImpl.java:265) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.initialize(RackawareEnsemblePlacementPolicyImpl.java:80) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.initializeEnsemblePlacementPolicy(BookKeeper.java:581) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.<init>(BookKeeper.java:505) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.<init>(BookKeeper.java:344) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeperAdmin.<init>(BookKeeperAdmin.java:164) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.tools.cli.commands.bookies.DecommissionCommand.decommission(DecommissionCommand.java:91) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.tools.cli.commands.bookies.DecommissionCommand.apply(DecommissionCommand.java:82) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell$DecommissionBookieCmd.runCmd(BookieShell.java:1956) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell$MyCommand.runCmd(BookieShell.java:238) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell.run(BookieShell.java:2278) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell.main(BookieShell.java:2369) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
    2022-05-25 16:04:09.0818 [main] INFO  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Initialize rackaware ensemble placement policy @ <Bookie:10.101.129.65:0> @ /default-rack : org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy$DefaultResolver.
    2022-05-25 16:04:09.0818 [main] INFO  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Not weighted
    2022-05-25 16:04:09.0827 [main] INFO  org.apache.bookkeeper.client.BookKeeper - Weighted ledger placement is not enabled
    2022-05-25 16:04:09.0903 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.68:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.68, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0903 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.65:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.65, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0904 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.70:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.70, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.68:3181
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.65:3181
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.70:3181
    2022-05-25 16:04:09.0945 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task
    2022-05-25 16:05:04.0446 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 10
    2022-05-25 16:06:44.0455 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:06:54.0458 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:04.0460 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:14.0462 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:24.0464 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:34.0465 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:44.0467 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:54.0471 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:04.0473 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:14.0474 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:24.0476 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:34.0478 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:44.0480 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:54.0482 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:04.0485 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:14.0487 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:24.0489 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:34.0490 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:44.0492 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:54.0494 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:04.0496 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:14.0498 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:24.0500 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1

    Continuously prints "Count of Ledgers which need to be rereplicated: 1" for a whole day without ending

System configuration

Pulsar version: 2.9.2 bookeeper:4.14.4

rvalenzhwx commented 1 year ago

Hello @sijie. It seems this issue is affecting a customer and we are seeing basically the same errors on logs. The current issue is stale. So does this mean that this issue will not be fixed in the near future?