vert-x3 / vertx-zookeeper

Zookeeper based cluster manager implementation
Other
73 stars 67 forks source link

vertx-zookeeper doesn't gracefully disconnect from zookeeper server #101

Open srinskit opened 4 years ago

srinskit commented 4 years ago

Version

Vertx and everything else vertx: 3.9.1 Zookeeper server: Checked with 3.4 and 3.6

Context

When the vertx instance using vertx-zookeeper exits, a socket exception is thrown at the zookeeper server. The server maintains the records from that instance till the session times out. This causes other vertx instances to attempt to connect to dead services.

Reproducer

A calculator application with an API-server verticle and an adder-service verticle. Repo. The latest Zookeeper server docker image running at localhost and default port with default configurations. Ref.

  1. Starting Zookeeper server
    docker run -p 2181:2181 zookeeper:latest
  2. Build reproducer package
    mvn clean package
  3. Starting vertx instance 1
    
    # Directory needed by API server
    mkdir data

API server and Adder service

java -jar target/calc-0.0.1-SNAPSHOT-fat.jar -m adder-service api-server --host localhost --zookeepers localhost

4. Starting vertx instance 2
```sh
# Just the Adder service
java -jar target/calc-0.0.1-SNAPSHOT-fat.jar -m adder-service --host localhost --zookeepers localhost
  1. Check zookeeper tree
    zookeepercli -servers localhost -c lsr /io.vertx

    Two adder service addresses are present, as expected.

    asyncMultiMap
    asyncMultiMap/__vertx.subs
    asyncMultiMap/__vertx.subs/adder-service-address
    asyncMultiMap/__vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823
    asyncMultiMap/__vertx.subs/adder-service-address/c94b5ad8-15ec-4847-9e94-768040eed01a:localhost:38607
    cluster
    cluster/nodes
    cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813
    cluster/nodes/c94b5ad8-15ec-4847-9e94-768040eed01a
    locks
    locks/__cluster_init_lock
    locks/__cluster_init_lock/leases
    locks/__cluster_init_lock/locks
    syncMap
    syncMap/__vertx.haInfo
    syncMap/__vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813
    syncMap/__vertx.haInfo/c94b5ad8-15ec-4847-9e94-768040eed01a
  2. Close instance 2 with CTRL-C

Zookeeper server throws this instantly

2020-07-10 13:10:10,102 [myid:1] - WARN  [NIOWorkerThread-6:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.17.0.1:53626, session = 0x100014c6f350004
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Checking zookeeper tree again shows the adder service entry from instance 2 still exists.

asyncMultiMap
asyncMultiMap/__vertx.subs
asyncMultiMap/__vertx.subs/adder-service-address
asyncMultiMap/__vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823
asyncMultiMap/__vertx.subs/adder-service-address/c94b5ad8-15ec-4847-9e94-768040eed01a:localhost:38607
cluster
cluster/nodes
cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813
cluster/nodes/c94b5ad8-15ec-4847-9e94-768040eed01a
locks
locks/__cluster_init_lock
locks/__cluster_init_lock/leases
locks/__cluster_init_lock/locks
syncMap
syncMap/__vertx.haInfo
syncMap/__vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813
syncMap/__vertx.haInfo/c94b5ad8-15ec-4847-9e94-768040eed01a

Zookeeper server eventually removes the session

2020-07-10 13:10:26,484 [myid:1] - INFO  [SessionTracker:ZooKeeperServer@600] - Expiring session 0x100014c6f350004, timeout of 20000ms exceeded

and the tree is fixed

asyncMultiMap
asyncMultiMap/__vertx.subs
asyncMultiMap/__vertx.subs/adder-service-address
asyncMultiMap/__vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823
cluster
cluster/nodes
cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813
locks
locks/__cluster_init_lock
locks/__cluster_init_lock/leases
locks/__cluster_init_lock/locks
syncMap
syncMap/__vertx.haInfo
syncMap/__vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813
stream-iori commented 4 years ago

Hi

The repo in Reproducer 404.

On Jul 11, 2020, at 11:14 AM, Srinag Rao S notifications@github.com wrote:

Version

Vertx and everything else vertx: 3.9.1 Zookeeper server: Checked with 3.4 and 3.6

Context

When the vertx instance using vertx-zookeeper exits, a socket exception is thrown at the zookeeper server. The server maintains the records from that instance till the session times out. This causes other vertx instances to attempt to connect to dead services.

Reproducer

A calculator application with an API-server verticle and an adder-service verticle. Repo https://github.com/srinskit/redesigned-enigma. The latest Zookeeper server docker image running at localhost and default port with default configurations. Ref https://hub.docker.com/_/zookeeper/.

Starting Zookeeper server docker run -p 2181:2181 zookeeper:latest Build reproducer package mvn clean package Starting vertx instance 1

API server and Adder service

java -jar target/calc-0.0.1-SNAPSHOT-fat.jar -m adder-service api-server --host localhost --zookeepers localhost Starting vertx instance 2

Just the Adder service

java -jar target/calc-0.0.1-SNAPSHOT-fat.jar -m adder-service --host localhost --zookeepers localhost Check zookeeper tree zookeepercli -servers localhost -c lsr /io.vertx Two adder service addresses are present, as expected.

asyncMultiMap asyncMultiMap/vertx.subs asyncMultiMap/vertx.subs/adder-service-address asyncMultiMap/vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823 asyncMultiMap/vertx.subs/adder-service-address/c94b5ad8-15ec-4847-9e94-768040eed01a:localhost:38607 cluster cluster/nodes cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813 cluster/nodes/c94b5ad8-15ec-4847-9e94-768040eed01a locks locks/cluster_init_lock locks/cluster_init_lock/leases locks/cluster_init_lock/locks syncMap syncMap/vertx.haInfo syncMap/vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813 syncMap/vertx.haInfo/c94b5ad8-15ec-4847-9e94-768040eed01a Close instance 2 with CTRL-C Zookeeper server throws this instantly

2020-07-10 13:10:10,102 [myid:1] - WARN [NIOWorkerThread-6:NIOServerCnxn@364] - Unexpected exception EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.17.0.1:53626, session = 0x100014c6f350004 at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

Checking zookeeper tree again shows the adder service entry from instance 2 still exists.

asyncMultiMap asyncMultiMap/vertx.subs asyncMultiMap/vertx.subs/adder-service-address asyncMultiMap/vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823 asyncMultiMap/vertx.subs/adder-service-address/c94b5ad8-15ec-4847-9e94-768040eed01a:localhost:38607 cluster cluster/nodes cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813 cluster/nodes/c94b5ad8-15ec-4847-9e94-768040eed01a locks locks/cluster_init_lock locks/cluster_init_lock/leases locks/cluster_init_lock/locks syncMap syncMap/vertx.haInfo syncMap/vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813 syncMap/vertx.haInfo/c94b5ad8-15ec-4847-9e94-768040eed01a Zookeeper server eventually removes the session

2020-07-10 13:10:26,484 [myid:1] - INFO [SessionTracker:ZooKeeperServer@600] - Expiring session 0x100014c6f350004, timeout of 20000ms exceeded and the tree is fixed

asyncMultiMap asyncMultiMap/vertx.subs asyncMultiMap/vertx.subs/adder-service-address asyncMultiMap/vertx.subs/adder-service-address/ab3d7101-ee73-4460-8f36-fc7a42aae813:localhost:36823 cluster cluster/nodes cluster/nodes/ab3d7101-ee73-4460-8f36-fc7a42aae813 locks locks/cluster_init_lock locks/cluster_init_lock/leases locks/cluster_init_lock/locks syncMap syncMap/vertx.haInfo syncMap/vertx.haInfo/ab3d7101-ee73-4460-8f36-fc7a42aae813 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vert-x3/vertx-zookeeper/issues/101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCWEZSJ7RW7IIKS2U7BADR27KIXANCNFSM4OXDEYCQ.

srinskit commented 4 years ago

Sorry. Made it public.

srinskit commented 4 years ago

Shouldn't ZookeeperClusterManager.leave result in an instant session termination?

stream-iori commented 4 years ago

In fact, Ctrl + C will make a signal that can not invoke vertx.close() directly, you should have Runtime.getRuntime().addShutdownHook to catch such signal, and then invoke vertx.close() I have try to do this, but still blocked by something, I will look into this in one or two weeks.

On Jul 12, 2020, at 1:08 PM, Srinag Rao S notifications@github.com wrote:

Shouldn't

srinskit commented 4 years ago

A similar setup but withvertx-hazelcast + zookeeper appears to handle this well. On exit, relevant hazelcast znodes are instantly removed from zookeeper. I have this reproducer in the hazel_zoo_cluster branch if you would like to check it out.

srinskit commented 4 years ago

Can this be temporarily fixed with any configuration of the Curator or Zookeeper? (Other than reducing the session time)

wjw465150 commented 1 year ago

i found use creatingParentContainersIfNeeded replace creatingParentsIfNeeded can solve this problem!

   /**
     * Causes any parent nodes to get created using {@link CreateMode#CONTAINER} if they haven't already been.
     * IMPORTANT NOTE: container creation is a new feature in recent versions of ZooKeeper.
     * If the ZooKeeper version you're using does not support containers, the parent nodes
     * are created as ordinary PERSISTENT nodes.
     *
     * @return this
     */
    public ProtectACLCreateModeStatPathAndBytesable<String> creatingParentContainersIfNeeded();