spring-cloud / spring-cloud-zookeeper

Spring Cloud Zookeeper
http://cloud.spring.io/spring-cloud-zookeeper/
Apache License 2.0
556 stars 413 forks source link

Spring cloud config client config server discovery creates a lot (hundreds) of connections to zookeeper. #322

Open geekonek opened 1 year ago

geekonek commented 1 year ago

Describe the bug Spring cloud config client config server discovery creates a lot (hundreds) of connections to zookeeper.

We are using spring config server discovery via zookeeper. After migrating to spring 6 and cloud 2022.0.x config server discovery started to create a lot of zookeeper connections (and keeping them alive).

Number of connections depend on number of services already registered in zk. And each time new service gets registered or dissapears from zk, new curator event gets published and another connection / curator instance is getting created.

Zookeeper servers are starting to reject connections due to max client connections limit, while application is trying to dos it with connection retry attempts constantly:

2023-09-08 09:45:46,161 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,161 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,161 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51638, server: /10.195.0.22:2181
2023-09-08 09:45:46,162 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51640, server: /10.195.0.22:2181
2023-09-08 09:45:46,162 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:348)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)
2023-09-08 09:45:46,162 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2023-09-08 09:45:46,174 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,174 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51672, server: /10.195.0.22:2181
2023-09-08 09:45:46,175 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2023-09-08 09:45:46,250 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,251 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51748, server: /10.195.0.22:2181
2023-09-08 09:45:46,251 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2023-09-08 09:45:46,336 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,336 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51832, server: /10.195.0.22:2181
2023-09-08 09:45:46,337 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2023-09-08 09:45:46,378 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Opening socket connection to server /10.195.0.22:2181.
2023-09-08 09:45:46,379 INFO lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Socket connection established, initiating session, client: /10.195.0.22:51882, server: /10.195.0.22:2181
2023-09-08 09:45:46,379 WARN lientCnxn$SendThread [Curator-TreeCache-0-SendThread(10.195.0.22:2181)] Session 0x0 for sever /10.195.0.22:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
...

This gets called right after appplication start and creates (in my case, since zk has some stuff already registered) ~30 connections:

curatorFramework:65, CuratorFactory (org.springframework.cloud.zookeeper)
apply:176, ZookeeperConfigServerBootstrapper$ZookeeperFunction (org.springframework.cloud.zookeeper.discovery.configclient)
getConfigServerInstances:74, ConfigServerInstanceProvider (org.springframework.cloud.config.client)
refresh:90, ConfigServerInstanceMonitor (org.springframework.cloud.config.client)
heartbeat:81, ConfigServerInstanceMonitor (org.springframework.cloud.config.client)
➜ netstat -alpn  | grep 2181
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 10.192.1.13:49572       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34700       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34498       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34678       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34582       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34608       10.195.0.22:2181        ESTABLISHED 613337/java         
...snipo...         
tcp6       0      0 10.192.1.13:34568       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34502       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34666       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34690       10.195.0.22:2181        ESTABLISHED 613337/java         
tcp6       0      0 10.192.1.13:34486       10.195.0.22:2181        ESTABLISHED 613337/java

Multiple threads are kept alive

"Curator-ConnectionStateManager-0" #32 [617863] daemon prio=5 os_prio=0 cpu=1,28ms elapsed=26,37s tid=0x00007fb4b81d5460 nid=617863 waiting on condition  [0x00007fb4eeffc000]
"Curator-Framework-0" #35 [617866] daemon prio=5 os_prio=0 cpu=1,98ms elapsed=26,35s tid=0x00007fb4b81f7620 nid=617866 waiting on condition  [0x00007fb4eecfc000]
"Curator-ConnectionStateManager-0" #53 [617888] daemon prio=5 os_prio=0 cpu=0,36ms elapsed=25,34s tid=0x00007fb4b8518e80 nid=617888 waiting on condition  [0x00007fb4edee6000]
"Curator-Framework-0" #56 [617891] daemon prio=5 os_prio=0 cpu=0,22ms elapsed=25,34s tid=0x00007fb4b851fd20 nid=617891 waiting on condition  [0x00007fb4edbe6000]
"Curator-TreeCache-0" #138 [618030] daemon prio=5 os_prio=0 cpu=293,16ms elapsed=14,62s tid=0x00007fb454002530 nid=618030 waiting on condition  [0x00007fb3caafe000]
"Curator-ConnectionStateManager-0" #139 [618031] daemon prio=5 os_prio=0 cpu=0,27ms elapsed=14,62s tid=0x00007fb3a0004eb0 nid=618031 waiting on condition  [0x00007fb3ca9fe000]
"Curator-TreeCache-0-SendThread(10.195.0.22:2181)" #141 [618033] daemon prio=5 os_prio=0 cpu=2,45ms elapsed=14,62s tid=0x00007fb3a000a010 nid=618033 runnable  [0x00007fb3ca7fe000]
"Curator-TreeCache-0-EventThread" #142 [618034] daemon prio=5 os_prio=0 cpu=0,62ms elapsed=14,62s tid=0x00007fb3a000b900 nid=618034 waiting on condition  [0x00007fb3ca6fe000]
"Curator-Framework-0" #143 [618035] daemon prio=5 os_prio=0 cpu=0,22ms elapsed=14,62s tid=0x00007fb3a000d5f0 nid=618035 waiting on condition  [0x00007fb3ca5fe000]
"Curator-ConnectionStateManager-0" #166 [618057] daemon prio=5 os_prio=0 cpu=0,29ms elapsed=14,55s tid=0x00007fb3a0010750 nid=618057 waiting on condition  [0x00007fb4ef3fc000]
"Curator-TreeCache-0-SendThread(10.195.0.22:2181)" #167 [618058] daemon prio=5 os_prio=0 cpu=2,25ms elapsed=14,55s tid=0x00007fb3a0014640 nid=618058 runnable  [0x00007fb3c90fe000]
"Curator-TreeCache-0-EventThread" #168 [618059] daemon prio=5 os_prio=0 cpu=0,55ms elapsed=14,55s tid=0x00007fb3a0015a40 nid=618059 waiting on condition  [0x00007fb3c8ffe000]
"Curator-Framework-0" #169 [618060] daemon prio=5 os_prio=0 cpu=0,25ms elapsed=14,55s tid=0x00007fb3a0016eb0 nid=618060 waiting on condition  [0x00007fb3c8efe000]
"Curator-ConnectionStateManager-0" #172 [618139] daemon prio=5 os_prio=0 cpu=0,24ms elapsed=14,48s tid=0x00007fb3a0019d70 nid=618139 waiting on condition  [0x00007fb3c81fc000]
"Curator-TreeCache-0-SendThread(10.195.0.22:2181)" #173 [618140] daemon prio=5 os_prio=0 cpu=2,30ms elapsed=14,48s tid=0x00007fb3a001e320 nid=618140 runnable  [0x00007fb38c7f7000]
"Curator-TreeCache-0-EventThread" #174 [618141] daemon prio=5 os_prio=0 cpu=0,62ms elapsed=14,48s tid=0x00007fb3a001f7f0 nid=618141 waiting on condition  [0x00007fb38c6f7000]
"Curator-Framework-0" #175 [618142] daemon prio=5 os_prio=0 cpu=0,22ms elapsed=14,48s tid=0x00007fb3a0020c90 nid=618142 waiting on condition  [0x00007fb38c5f7000]
"Curator-ConnectionStateManager-0" #185 [618160] daemon prio=5 os_prio=0 cpu=0,31ms elapsed=14,41s tid=0x00007fb3a0022d10 nid=618160 waiting on condition  [0x00007fb38c2f7000]
"Curator-TreeCache-0-SendThread(10.195.0.22:2181)" #186 [618161] daemon prio=5 os_prio=0 cpu=2,56ms elapsed=14,41s tid=0x00007fb3a0026fa0 nid=618161 runnable  [0x00007fb38c1f7000]
"Curator-TreeCache-0-EventThread" #187 [618162] daemon prio=5 os_prio=0 cpu=0,84ms elapsed=14,41s tid=0x00007fb3a00283f0 nid=618162 waiting on condition  [0x00007fb3653bb000]
"Curator-Framework-0" #188 [618163] daemon prio=5 os_prio=0 cpu=0,27ms elapsed=14,41s tid=0x00007fb3a0029890 nid=618163 waiting on condition  [0x00007fb3652bb000]
...snip...

Full stack of what happens when new service gets registered in zk

<init>:106, ConnectionStateManager (org.apache.curator.framework.state)
<init>:140, CuratorFrameworkImpl (org.apache.curator.framework.imps)
build:161, CuratorFrameworkFactory$Builder (org.apache.curator.framework)
curatorFramework:65, CuratorFactory (org.springframework.cloud.zookeeper)
apply:176, ZookeeperConfigServerBootstrapper$ZookeeperFunction (org.springframework.cloud.zookeeper.discovery.configclient)
getConfigServerInstances:74, ConfigServerInstanceProvider (org.springframework.cloud.config.client)
refresh:90, ConfigServerInstanceMonitor (org.springframework.cloud.config.client)
heartbeat:81, ConfigServerInstanceMonitor (org.springframework.cloud.config.client)
onApplicationEvent:69, ConfigServerInstanceMonitor (org.springframework.cloud.config.client)
doInvokeListener:172, SimpleApplicationEventMulticaster (org.springframework.context.event)
invokeListener:165, SimpleApplicationEventMulticaster (org.springframework.context.event)
multicastEvent:143, SimpleApplicationEventMulticaster (org.springframework.context.event)
publishEvent:437, AbstractApplicationContext (org.springframework.context.support)
publishEvent:370, AbstractApplicationContext (org.springframework.context.support)
childEvent:97, ZookeeperServiceWatch (org.springframework.cloud.zookeeper.discovery)
lambda$callListeners$0:810, TreeCache (org.apache.curator.framework.recipes.cache)
accept:-1, TreeCache$$Lambda$1997/0x0000000801dbf528 (org.apache.curator.framework.recipes.cache)
lambda$forEach$0:92, MappingListenerManager (org.apache.curator.framework.listen)
run:-1, MappingListenerManager$$Lambda$346/0x0000000801209958 (org.apache.curator.framework.listen)
execute:-1, MappingListenerManager$$Lambda$341/0x00000008011e8c60 (org.apache.curator.framework.listen)
forEach:89, MappingListenerManager (org.apache.curator.framework.listen)
forEach:89, StandardListenerManager (org.apache.curator.framework.listen)
callListeners:806, TreeCache (org.apache.curator.framework.recipes.cache)
access$1800:78, TreeCache (org.apache.curator.framework.recipes.cache)
run:908, TreeCache$2 (org.apache.curator.framework.recipes.cache)
call:577, Executors$RunnableAdapter (java.util.concurrent)
run$$$capture:317, FutureTask (java.util.concurrent)
run:-1, FutureTask (java.util.concurrent)
 - Async stack trace
<init>:151, FutureTask (java.util.concurrent)
newTaskFor:98, AbstractExecutorService (java.util.concurrent)
submit:122, AbstractExecutorService (java.util.concurrent)
submit:786, Executors$DelegatedExecutorService (java.util.concurrent)
publishEvent:901, TreeCache (org.apache.curator.framework.recipes.cache)
publishEvent:893, TreeCache (org.apache.curator.framework.recipes.cache)
access$1100:78, TreeCache (org.apache.curator.framework.recipes.cache)
processResult:492, TreeCache$TreeNode (org.apache.curator.framework.recipes.cache)
sendToBackgroundCallback:892, CuratorFrameworkImpl (org.apache.curator.framework.imps)
processBackgroundOperation:649, CuratorFrameworkImpl (org.apache.curator.framework.imps)
processBackgroundOperation:152, WatcherRemovalFacade (org.apache.curator.framework.imps)
processResult:272, GetDataBuilderImpl$3 (org.apache.curator.framework.imps)
processEvent:630, ClientCnxn$EventThread (org.apache.zookeeper)
run:551, ClientCnxn$EventThread (org.apache.zookeeper)

Sample Simple example to replicate this https://github.com/geekonek/spring-cloud-zookeeper-config-server-discovery-dos-example

number of connections initially created depends on services already registered in zk. But new connections are creted each time some service registers or unregisters.