sofastack / sofa-registry

SOFARegistry is a production-level, low-latency, high-availability service registry powered by Ant Financial.
https://www.sofastack.tech/sofa-registry/docs/Home
Apache License 2.0
650 stars 244 forks source link

服务端部署注册中心(SOFARegistry)时,会出现9622端口打开成功,但是检查失败 #156

Open Anlet opened 3 years ago

Anlet commented 3 years ago

Describe the bug

服务端使用SOFARegistry v5.4.2 部署注册中心时,会出现9622端口打开成功,但是检查失败的问题。

Expected behavior

查看meta角色的健康检测接口:

$ curl http://localhost:9615/health/check {"success":true,"message":"... raftStatus:Leader"}

查看data角色的健康检测接口:

$ curl http://localhost:9622/health/check {"success":true,"message":"... status:WORKING"}

查看session角色的健康检测接口:

$ curl http://localhost:9603/health/check {"success":true,"message":"..."}

Actual behavior

查看meta角色的健康检测接口:

$ curl http://localhost:9615/health/check {"success":true,"message":"... raftStatus:Leader"}

查看data角色的健康检测接口:

$ curl http://localhost:9622/health/check {"success":false,"message":"DataServerBoot severForSession:true, severForDataSync:true, httpServer:true, schedulerStarted:true, status:INITIAL"}

查看session角色的健康检测接口:

$ curl http://localhost:9603/health/check curl: (7) Failed to connect to localhost port 9603: Connection refused

Steps to reproduce

  1. 下载安装包方式 https://github.com/sofastack/sofa-registry/releases/download/v5.4.2/registry-integration-fix.tgz
  2. 创建文件夹并解压到刚创建的文件夹里 mkdir registry-integration tar -zxvf registry-integration-fix.tgz -C registry-integration cd registry-integration
  3. 启动 registry-integration cd registry-integration/ sh bin/startup.sh
  4. 检查三个接口,发现一个访问成功,一个访问失败,一个拒绝访问

    Minimal yet complete reproducer code (or GitHub URL to code)

Environment

Anlet commented 3 years ago

打印的错误日志: Command: java -Dregistry.integration.home=/usr/local/software/registry-integration -Dspring.config.location=/usr/local/software/registry-integration/conf/application.properties -Duser.home=/usr/local/software/registry-integration -server -Xms512m -Xmx512m -Xmn256m -Xss256k -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/software/registry-integration/logs/registry-integration-gc.log -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/software/registry-integration/logs -XX:ErrorFile=/usr/local/software/registry-integration/logs/registry-integration-hs_err_pid%p.log -XX:-OmitStackTraceInFastThrow -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+CMSClassUnloadingEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -jar /usr/local/software/registry-integration/registry-integration.jar --logging.config=/usr/local/software/registry-integration/conf/logback-spring.xml Sofa-Middleware-Log SLF4J : Actual binding is of type [ com.alipay.remoting Logback ] [2021-04-01 17:49:44,219][INFO][main][MetaServerBootstrap] - the configuration items are as follows: com.alipay.sofa.registry.server.meta.bootstrap.MetaServerConfigBean@7eac9008[ sessionServerPort=9610 dataServerPort=9611 metaServerPort=9612 httpServerPort=9615 schedulerHeartbeatTimeout=3 schedulerHeartbeatFirstDelay=3 schedulerHeartbeatExpBackOffBound=10 schedulerGetDataChangeTimeout=5 schedulerGetDataChangeFirstDelay=5 schedulerGetDataChangeExpBackOffBound=5 schedulerConnectMetaServerTimeout=3 schedulerConnectMetaServerFirstDelay=3 schedulerConnectMetaServerExpBackOffBound=10 schedulerCheckNodeListChangePushTimeout=3 schedulerCheckNodeListChangePushFirstDelay=1 schedulerCheckNodeListChangePushExpBackOffBound=10 dataNodeExchangeTimeout=3000 sessionNodeExchangeTimeout=3000 metaNodeExchangeTimeout=3000 dataCenterChangeNotifyTaskRetryTimes=3 dataNodeChangePushTaskRetryTimes=1 getDataCenterChangeListTaskRetryTimes=3 receiveStatusConfirmNotifyTaskRetryTimes=3 sessionNodeChangePushTaskRetryTimes=3 enableMetrics=true decisionMode= raftDataPath=/usr/local/software/registry-integration/raftData rockDBCacheSize=64 heartbeatCheckExecutorMinSize=3 heartbeatCheckExecutorMaxSize=10 heartbeatCheckExecutorQueueSize=1024 checkDataChangeExecutorMinSize=3 checkDataChangeExecutorMaxSize=10 checkDataChangeExecutorQueueSize=1024 getOtherDataCenterChangeExecutorMinSize=3 getOtherDataCenterChangeExecutorMaxSize=10 getOtherDataCenterChangeExecutorQueueSize=1024 connectMetaServerExecutorMinSize=3 connectMetaServerExecutorMaxSize=10 connectMetaServerExecutorQueueSize=1024 checkNodeListChangePushExecutorMinSize=3 checkNodeListChangePushExecutorMaxSize=10 checkNodeListChangePushExecutorQueueSize=1024 raftClientRefreshExecutorMinSize=3 raftClientRefreshExecutorMaxSize=10 raftClientRefreshExecutorQueueSize=1024 metaSchedulerPoolSize=6 ] [2021-04-01 17:49:44,312][INFO][main][MetaServerBootstrap] - Open session node register server port 9610 success! [2021-04-01 17:49:44,319][INFO][main][MetaServerBootstrap] - Open data node register server port 9611 success! [2021-04-01 17:49:44,321][INFO][main][MetaServerBootstrap] - Open meta server port 9612 success! [2021-04-01 17:49:45,148][INFO][main][MetaServerBootstrap] - Open http server port 9615 success! [2021-04-01 17:49:45,518][INFO][main][MetaServerBootstrap] - Raft server port 9614 start success!group RegistryGroup [2021-04-01 17:49:45,518][INFO][main][MetaServerBootstrap] - Raft client connect success! [2021-04-01 17:49:45,521][INFO][main][MetaServerBootstrap] - Raft start CliService success! [2021-04-01 17:49:45,522][INFO][main][MetaServerInitializerConfiguration] - Started MetaServer [2021-04-01 17:49:46,406][INFO][main][RegistryApplication] - localhost:9615 health check success. [2021-04-01 17:49:47,534][INFO][main][DataServerBootstrap] - begin start server [2021-04-01 17:49:47,535][INFO][main][DataServerBootstrap] - the configuration items are as follows: com.alipay.sofa.registry.server.data.bootstrap.DataServerConfig@5b1f29fa[ port=9620 syncDataPort=9621 metaServerPort=9611 httpServerPort=9622 queueCount=4 queueSize=10240 notifyIntervalMs=500 clientOffDelayMs=0 notifyTempDataIntervalMs=0 rpcTimeout=3000 commonConfig=com.alipay.sofa.registry.server.data.bootstrap.CommonConfig@aeab9a1 metaIps= storeNodes=3 numberOfReplicas=1000 localDataServerCleanDelay=1800000 getDataExecutorMinPoolSize=80 getDataExecutorMaxPoolSize=400 getDataExecutorQueueSize=10000 getDataExecutorKeepAliveTime=60 notifyDataSyncExecutorMinPoolSize=80 notifyDataSyncExecutorMaxPoolSize=400 notifyDataSyncExecutorQueueSize=700 notifyDataSyncExecutorKeepAliveTime=60 notifySessionRetryFirstDelay=3000 notifySessionRetryIncrementDelay=3000 notifySessionRetryTimes=5 publishExecutorMinPoolSize=200 publishExecutorMaxPoolSize=400 publishExecutorQueueSize=10000 renewDatumExecutorMinPoolSize=100 renewDatumExecutorMaxPoolSize=400 renewDatumExecutorQueueSize=100000 datumTimeToLiveSec=900 datumLeaseManagerExecutorThreadSize=1 datumLeaseManagerExecutorQueueSize=1000000 sessionServerNotifierRetryExecutorThreadSize=10 sessionServerNotifierRetryExecutorQueueSize=10000 renewEnableDelaySec=30 dataSyncDelayTimeout=1000 dataSyncNotifyRetry=3 ] [2021-04-01 17:49:47,545][INFO][main][DataServerBootstrap] - Data server for session started! port:9620 [2021-04-01 17:49:47,548][INFO][main][DataServerBootstrap] - Data server for sync started! port:9621 [2021-04-01 17:49:47,640][INFO][main][DataServerBootstrap] - Open http server port 9622 success! [2021-04-01 17:49:47,782][INFO][main][DataServerBootstrap] - raft client started!Leader is 172.20.0.1:9614 [2021-04-01 17:49:47,788][INFO][main][DataServerBootstrap] - Fetch enableDataDatumExpire but no data existed, current config not change! [2021-04-01 17:49:47,794][INFO][main][DataServerBootstrap] - start server success [2021-04-01 17:49:47,829][ERROR][main][RegistryApplication] - localhost:9622 health check failed. javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1098) at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883) at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767) at org.glassfish.jersey.internal.Errors.process(Errors.java:316) at org.glassfish.jersey.internal.Errors.process(Errors.java:298) at org.glassfish.jersey.internal.Errors.process(Errors.java:229) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414) at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765) at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:428) at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:324) at com.alipay.sofa.registry.server.integration.RegistryApplication.nodeHealthCheck(RegistryApplication.java:134) at com.alipay.sofa.registry.server.integration.RegistryApplication.waitClusterStart(RegistryApplication.java:119) at com.alipay.sofa.registry.server.integration.RegistryApplication.main(RegistryApplication.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:109) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88) [2021-04-01 17:49:47,830][ERROR][main][RegistryApplication] - localhost:9622 health check failed. [2021-04-01 17:49:48,840][ERROR][main][RegistryApplication] - localhost:9622 health check failed. javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1098) at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883) at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767) at org.glassfish.jersey.internal.Errors.process(Errors.java:316) at org.glassfish.jersey.internal.Errors.process(Errors.java:298) at org.glassfish.jersey.internal.Errors.process(Errors.java:229) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414) at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765) at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:428) at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:324) at com.alipay.sofa.registry.server.integration.RegistryApplication.nodeHealthCheck(RegistryApplication.java:134) at com.alipay.sofa.registry.server.integration.RegistryApplication.waitClusterStart(RegistryApplication.java:119) at com.alipay.sofa.registry.server.integration.RegistryApplication.main(RegistryApplication.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:109) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88) [2021-04-01 17:49:48,841][ERROR][main][RegistryApplication] - localhost:9622 health check failed. [2021-04-01 17:49:49,846][ERROR][main][RegistryApplication] - localhost:9622 health check failed. javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1098) at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883) at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767) at org.glassfish.jersey.internal.Errors.process(Errors.java:316) at org.glassfish.jersey.internal.Errors.process(Errors.java:298) at org.glassfish.jersey.internal.Errors.process(Errors.java:229) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414) at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765) at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:428) at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:324) at com.alipay.sofa.registry.server.integration.RegistryApplication.nodeHealthCheck(RegistryApplication.java:134) at com.alipay.sofa.registry.server.integration.RegistryApplication.waitClusterStart(RegistryApplication.java:119) at com.alipay.sofa.registry.server.integration.RegistryApplication.main(RegistryApplication.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:109) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88) [2021-04-01 17:49:49,847][ERROR][main][RegistryApplication] - localhost:9622 health check failed. [2021-04-01 17:49:50,853][ERROR][main][RegistryApplication] - localhost:9622 health check failed. [Uploading registry-integration-std.log…]()

Anlet commented 3 years ago

经过看源码,上网百度等一番折腾后:在注册中心(SOFARegistry)的启动配置文件中加上指定网卡(JAVA_OPTS="$JAVA_OPTS -Dnetwork_interface_binding=eth0"),终于可以正常启动了,三个端口检测均正常。正当我满心欢喜的填上远程注册中心的地址,然后本地项目启动的时候。访问网址时,居然报错了。没有获得服务。 image 查看日志(common-error.log)发现,我一直请求的是服务器内网的地址。 image 打开本地项目的配置文件确实没有配置错误,这就奇怪了?经过一番询问后,得知 注册中心要和客户端在同一个网段内。也就是说注册中心(SOFARegistry)在远程服务器上部署,本地项目使用远程服务的注册中心的地址。这样操作目前(v5.4.2)是不行的。

NickNYU commented 3 years ago

Good question, thanks to @Anlet

Let me explain the issue so that others would understand:

Q1: Registry fail to start due to an address issue A: In case of multiple nic on single machine, SOFA-Registry provides a strategy to leverage user's knowledge on selecting a nic's ip address as main address So, adding a system properties through java -Dnetwork_interface_binding=eth0 or System.setProperty() by java code is an effective way to deal with this problem.

Q2: Registry client fail to connect to Registry center A: Sofa-Registry is designed in a way as shown below:

  1. client connect to session-server,
  2. instead of pub/sub service immediately, client is retrieving whole batch of session servers' addresses 3.client pick up one address as target registry address, to publish or subscribe services from

the question is about step-2, when session is using loop-back as its address reporting to client, while client is standing outside the machine(say, in a cloud env, client is on ECS-1 machine while sofa-registry is running standalone mode on ECS-2 machine, leveraging 127.0.0.1 as session's address)

So, for question - 2, we'd like to provide a mechanism that registry-client has the ability to receive an well-defined address(through -D param or a config file, either way is OK) as the registry center's address. In previous case, @Anlet could claim, say, 10.0.0.1:9622,10.0.0.2:9622 as the registry session's address-list

NickNYU commented 3 years ago

@Anlet 我上面回复了一下,好让其他同学也能理解问题的过程 目前我们给出的方案是,客户端可以通过-D参数或者config文件,指定session server(也就是注册中心入口)的地址信息

不知道有没有兴趣一起来完成这个Feature? @Anlet

Jiiiiiin commented 2 years ago

按照方法(设置网卡,仅仅修改了这一个地方),启动日志都没有报错,但是health/check当前版本version_5.4.5的时候还是检查有问题:

Last login: Sun Oct 31 14:44:29 2021 from 10.0.2.2
[vagrant@localhost ~]$ netstat -anp|grep java
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 :::9610                 :::*                    LISTEN      5953/java
tcp6       0      0 :::9611                 :::*                    LISTEN      5953/java
tcp6       0      0 :::9612                 :::*                    LISTEN      5953/java
tcp6       0      0 :::9615                 :::*                    LISTEN      5953/java
tcp6       0      0 127.0.0.1:34864         127.0.0.1:9615          ESTABLISHED 5953/java
tcp6       0      0 127.0.0.1:9615          127.0.0.1:34864         ESTABLISHED 5953/java
unix  2      [ ]         STREAM     CONNECTED     33475    5953/java
unix  2      [ ]         STREAM     CONNECTED     33479    5953/java
[vagrant@localhost ~]$ curl http://localhost:9610/health/check
curl: (56) Recv failure: Connection reset by peer
[vagrant@localhost ~]$ curl http://localhost:9611/health/check
curl: (56) Recv failure: Connection reset by peer
[vagrant@localhost ~]$ curl http://localhost:9612/health/check
curl: (56) Recv failure: Connection reset by peer
[vagrant@localhost ~]$ curl http://localhost:9615/health/check
{"success":false,"message":"MetaServerBoot sessionRegisterServer:true, dataRegisterServerStart:true, otherMetaRegisterServerStart:true, httpServerStart:true, raftServerStart:false, raftClientStart:true, raftManagerStart:false, raftStatus:false"}[vagrant@localhost ~]$

网卡信息如下:

:true, raftServerStart:false, raftClientStart:true, raftManagerStart:false, raftStatus:false"}[vagrant@localhost ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 81337sec preferred_lft 81337sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:a9:97:49 brd ff:ff:ff:ff:ff:ff
    inet 192.168.33.10/24 brd 192.168.33.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fea9:9749/64 scope link
       valid_lft forever preferred_lft forever

修改的启动脚本涉及:

# set net
JAVA_OPTS="$JAVA_OPTS -Dnetwork_interface_binding=eth1"

因为使用vagrant 设置的网络配置:


  # Create a private network, which allows host-only access to the machine
  # using a specific IP.
  config.vm.network "private_network", ip: "192.168.33.10"

不知道:


curl: (56) Recv failure: Connection reset by peer

和

[vagrant@localhost ~]$ curl http://localhost:9615/health/check
{"success":false,"message":"MetaServerBoot sessionRegisterServer:true, dataRegisterServerStart:true, otherMetaRegisterServerStart:true, httpServerStart:true, raftServerStart:false, raftClientStart:true, raftManagerStart:false, raftStatus:false"}

健康检查 返回false是为何意思。

@NickNYU 麻烦帮忙看看

Jiiiiiin commented 2 years ago

启动日志:

[vagrant@localhost ~]$ curl http://localhost:9615/health/check {"success":false,"message":"MetaServerBoot sessionRegisterServer:true, dataRegisterServerStart:true, otherMetaRegisterServerStart:true, httpServerStart:true, raftServerStart:false, raftClientStart:true, raftManagerStart:false, raftStatus:false"}