sofastack / sofa-dashboard-client

Apache License 2.0
16 stars 25 forks source link

[#7] Restore session when reconnect to zookeeper #8

Closed chpengzh closed 5 years ago

chpengzh commented 5 years ago

zookeeper断线重连时对session node进行一次恢复

该实现逻辑类似于 sofa-rpc/ZookeeperRegistry

Tips: 如果是macbook进行测试,可以合上盖子进入待机状态大约30s。让进程进入suspended状态然后恢复,此时进程会 LOST->RECONNECTED, 执行一次恢复:

详情参考如下日志:

019-06-02 08:50:10.458  INFO 26127 --- [(5)-10.37.129.2] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring FrameworkServlet 'dispatcherServlet'
2019-06-02 08:50:10.458  INFO 26127 --- [(5)-10.37.129.2] o.s.web.servlet.DispatcherServlet        : FrameworkServlet 'dispatcherServlet': initialization started
2019-06-02 08:50:10.475  INFO 26127 --- [(5)-10.37.129.2] o.s.web.servlet.DispatcherServlet        : FrameworkServlet 'dispatcherServlet': initialization completed in 17 ms
2019-06-02 08:50:59.008  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Client session timed out, have not heard from server in 6571ms for sessionid 0x16b159ee79c001c, closing socket connection and attempting reconnect
2019-06-02 08:50:59.120  INFO 26127 --- [ain-EventThread] o.a.c.f.state.ConnectionStateManager     : State change: SUSPENDED
2019-06-02 08:51:00.476  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2019-06-02 08:51:00.476  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
2019-06-02 08:51:00.477  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Unable to reconnect to ZooKeeper service, session 0x16b159ee79c001c has expired, closing socket connection
2019-06-02 08:51:00.477  WARN 26127 --- [ain-EventThread] org.apache.curator.ConnectionState       : Session expired event received
2019-06-02 08:51:00.478  INFO 26127 --- [ain-EventThread] org.apache.zookeeper.ZooKeeper           : Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=6000 watcher=org.apache.curator.ConnectionState@798cb6d9
2019-06-02 08:51:00.478  INFO 26127 --- [ain-EventThread] o.a.c.f.state.ConnectionStateManager     : State change: LOST
2019-06-02 08:51:00.478  INFO 26127 --- [ain-EventThread] org.apache.zookeeper.ClientCnxn          : EventThread shut down
2019-06-02 08:51:00.478  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2019-06-02 08:51:00.479  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
2019-06-02 08:51:00.491  INFO 26127 --- [127.0.0.1:2181)] org.apache.zookeeper.ClientCnxn          : Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x16b159ee79c0020, negotiated timeout = 6000
2019-06-02 08:51:00.491  INFO 26127 --- [ain-EventThread] o.a.c.f.state.ConnectionStateManager     : State change: RECONNECTED
2019-06-02 08:51:00.491  INFO 26127 --- [nStateManager-0] c.a.s.d.c.zookeeper.ZkCommandClient      : Reconnect to Zookeeper, ZkClientReconnectedEvent is called
2019-06-02 08:51:00.492  INFO 26127 --- [nStateManager-0] lientApplicationContextRefreshedListener : Recover sofa dashboard client register.
2019-06-02 08:51:00.505  INFO 26127 --- [nStateManager-0] lientApplicationContextRefreshedListener : sofa dashboard client register success.
chpengzh commented 5 years ago

刚才发现这个patch并没有完全解决问题,会概率性出现如下问题导致恢复会话失败

2019-06-08 11:51:20.312  INFO 86367 --- [k-0-EventThread] o.a.c.f.state.ConnectionStateManager     : State change: RECONNECTED
2019-06-08 11:51:20.313  INFO 86367 --- [nStateManager-0] c.a.s.d.c.zookeeper.ZkCommandClient      : Reconnect to Zookeeper, ZkClientReconnectedEvent is called
2019-06-08 11:51:20.314  INFO 86367 --- [nStateManager-0] lientApplicationContextRefreshedListener : Recover sofa dashboard client register.
2019-06-08 11:51:20.314  INFO 86367 --- [nStateManager-0] lientApplicationContextRefreshedListener : Try to register application, status=UP
2019-06-08 11:51:20.344 ERROR 86367 --- [nStateManager-0] c.a.s.d.c.r.SofaDashboardClientRegister  : Failed to register application to zookeeper.

org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /apps/instance/sample-app-3/10.37.129.2:38083
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
    at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:720) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:703) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.9.1.jar:na]
    at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:699) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:477) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:467) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) ~[curator-framework-2.9.1.jar:na]
    at com.alipay.sofa.dashboard.client.registration.SofaDashboardClientRegister.register(SofaDashboardClientRegister.java:75) ~[sofa-dashboard-client-1.0.0.jar:1.0.0]
    at com.alipay.sofa.dashboard.client.listener.SofaDashboardClientApplicationContextRefreshedListener.doRegister(SofaDashboardClientApplicationContextRefreshedListener.java:71) [sofa-dashboard-client-1.0.0.jar:1.0.0]
    at com.alipay.sofa.dashboard.client.listener.SofaDashboardClientApplicationContextRefreshedListener.onEvent(SofaDashboardClientApplicationContextRefreshedListener.java:60) [sofa-dashboard-client-1.0.0.jar:1.0.0]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_151]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_151]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_151]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_151]
    at org.springframework.context.event.ApplicationListenerMethodAdapter.doInvoke(ApplicationListenerMethodAdapter.java:261) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.event.ApplicationListenerMethodAdapter.processEvent(ApplicationListenerMethodAdapter.java:180) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.event.ApplicationListenerMethodAdapter.onApplicationEvent(ApplicationListenerMethodAdapter.java:142) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:172) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:165) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:139) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:400) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:367) [spring-context-5.0.7.RELEASE.jar:5.0.7.RELEASE]
    at com.alipay.sofa.dashboard.client.zookeeper.ZkCommandClient.lambda$afterPropertiesSet$0(ZkCommandClient.java:68) [sofa-dashboard-client-1.0.0.jar:1.0.0]
    at org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92) ~[curator-framework-2.9.1.jar:na]
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) ~[guava-18.0.jar:na]
    at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:252) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) ~[curator-framework-2.9.1.jar:na]
    at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) ~[curator-framework-2.9.1.jar:na]
    at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) ~[na:1.8.0_151]
    at java.util.concurrent.FutureTask.run(FutureTask.java) ~[na:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151]

原因在于在恢复会话那一刻,有可能原先的会话节点还存在,导致恢复失败。 现在 app instance 在 zk 会话映射的规则是:

/apps/instance/<app_name>/<ip>:<port>

当连接状态恢复的时候客户端会尝试重新注册这个会话节点,但是此时旧的会话节点可能并没有失效导致恢复失败。

相反RPC那边因为添加了timestamp参数,所以恢复的时候总是注册的一个新的会话节点, 然后等待旧会话失效。

/sofa-rpc/<pkg_name>/providers?startTime=<timestamp>&...
/sofa-rpc/<pkg_name>/consumers?startTime=<timestamp>&...

因此要彻底修复这个bug需要商量一下修改在zk上的映射规则,希望会话节点需要加上一个timestamp之类的信息

chpengzh commented 5 years ago

经过商量,这个PR暂时关闭,后续会跟随模块改造问题一同修复。