sofastack / sofa-rpc

SOFARPC is a high-performance, high-extensibility, production-level Java RPC framework.
https://www.sofastack.tech/sofa-rpc/docs/Home
Apache License 2.0
3.84k stars 1.18k forks source link

服务重启导致客户端无法访问服务 #270

Closed davis1214 closed 6 years ago

davis1214 commented 6 years ago

Describe the bug

服务(provider)非优雅方式重启(kill -9 $pid & startup)时,客户端再次调用,出现如下报错

ERROR 2018-08-17 10:12:22,439 http-nio-30001-exec-10 org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet]:181 - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is com.alipay.sofa.rpc.core.exception.SofaRouteException: RPC-02306: 没有获得服务[com.choice.balance.api.BalanceApi:1.0]的调用地址,请检查服务是否已经推送 ] with root cause com.alipay.sofa.rpc.core.exception.SofaRouteException: RPC-02306: 没有获得服务[com.lotoblock.balance.api.BalanceApi:1.0]的调用地址,请检查服务是否已经推送 at com.alipay.sofa.rpc.client.AbstractCluster.noAvailableProviderException(AbstractCluster.java:418) ~[sofa-rpc-all-5.4.3.jar!/:5.4.3]

Expected behavior

期望暴力重启情况下,服务依然可以被调用

Actual behavior

在一定时间内(20秒内),kill -9 后直接重启,出现服务不可调用

原因

1、服务端zookeeker上生成的节点为CreateMode.EPHEMERAL类型,session中断时,客户端的会监听到remove时间,并删除客户端的对失效服务的链接。 2、服务端内存中同时维护每个providerInfo的信息,且key是一个对象。客户端接收到新的链接(provider)时,从内存先查,如果没有才创建。 3、在客户端没有remove失效的provider前,根据配置信息,会先查询到服务关闭之前的链接。即会出现此现象

 //ProviderInfo类中,hash实现,没有区分是删除前或后的实例信息

 @Override
    public int hashCode() {
        int result = (protocolType != null ? protocolType.hashCode() : 0);
        result = 31 * result + (host != null ? host.hashCode() : 0);
        result = 31 * result + port;
        result = 31 * result + (path != null ? path.hashCode() : 0);
        result = 31 * result + (serializationType != null ? serializationType.hashCode() : 0);
        result = 31 * result + rpcVersion;
        // result = 31 * result + (staticAttrs != null ? staticAttrs.hashCode() : 0);
        return result;
    }

处理方式

1、服务优雅重启 2、修改代码,ProviderInfo,或者reconnect检查时,删除com.alipay.sofa.rpc.transport.bolt.BoltClientConnectionManager.urlConnectionMap中无效链接,并重新初始化。

Environment

lxl0044 commented 6 years ago

Hi @ujjboy , I am using the latest version (5.4.5), but when killing the server with kill -9, the client will still report the above error, it seems that this bug has not been fixed?

lxl0044 commented 6 years ago

sorry, add the client's error: ERROR 2018-09-10 14:17:13,895 http-nio-30001-exec-7 org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet]:181 - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is com.alipay.sofa.rpc.core.exception.SofaRouteException: RPC-02306: Cannot get the service address of service [com.abc.api.TestApi:1.0], please check the registry log. ] with root cause com.alipay.sofa.rpc.core.exception.SofaRouteException: RPC-02306: Cannot get the service address of service [com.abc.api.TestApi:1.0], please check the registry log. at com.alipay.sofa.rpc.client.AbstractCluster.noAvailableProviderException(AbstractCluster.java:418) ~[sofa-rpc-all-5.4.5.jar!/:5.4.5] at com.alipay.sofa.rpc.client.AbstractCluster.select(AbstractCluster.java:354) ~[sofa-rpc-all-5.4.5.jar!/:5.4.5] at com.alipay.sofa.rpc.client.FailoverCluster.doInvoke(FailoverCluster.java:64) ~[sofa-rpc-all-5.4.5.jar!/:5.4.5] at com.alipay.sofa.rpc.client.AbstractCluster.invoke(AbstractCluster.java:285) ~[sofa-rpc-all-5.4.5.jar!/:5.4.5] at com.alipay.sofa.rpc.client.ClientProxyInvoker.invoke(ClientProxyInvoker.java:83) ~[sofa-rpc-all-5.4.5.jar!/:5.4.5] at com.abc.api.TestApi_proxy_4.Test(TestApi_proxy_4.java) ~[?:?] at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_71] at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_71] at com.alipay.sofa.runtime.service.binding.JvmBindingAdapter$JvmServiceInvoker.doInvoke(JvmBindingAdapter.java:183) ~[runtime-sofa-boot-starter-2.4.7.jar!/:2.4.7] at com.alipay.sofa.runtime.spi.service.ServiceProxy.invoke(ServiceProxy.java:39) ~[runtime-sofa-boot-starter-2.4.7.jar!/:2.4.7] at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) ~[spring-aop-4.3.4.RELEASE.jar!/:4.3.4.RELEASE] at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213) ~[spring-aop-4.3.4.RELEASE.jar!/:4.3.4.RELEASE]

could u help me with it? tnks a lot. @ujjboy @lifesinger @popomore @khotyn

ujjboy commented 6 years ago

@lxl0044 you mean use zookeeper as registry, and kill -9 server and restart it ?

lxl0044 commented 6 years ago

@ujjboy thks for resp so fast. yes, is there a problem with this operation?

ujjboy commented 6 years ago

No, it's normal case. Can you provide your demo for reproduce this ? and more logs in ~/logs/rpc.

lxl0044 commented 6 years ago

@ujjboy Have u ever had this problem in u'r tests?

Sorry, there are some other things in my demo that are not very convenient to upload, but the logic is like this:

// client:

##### properties ##### com.alipay.sofa.rpc.registry-address=zookeeper://127.0.0.1:2181

##### Client.java ##### `public class Client { @SofaReference(binding = @SofaReferenceBinding(bindingType = "bolt", timeout = 20000)) private TestApi testApi;

public String demo() {
    return testApi.test();  
}

}`

##### pom.xml ##### `

com.abc
<artifactId>test-api</artifactId>
<version>0.0.1-SNAPSHOT</version>

`

// server:

##### TestApi.java ##### public interface TestApi { String test(); }

##### TestApiImpl.java ##### @SofaService(bindings = {@SofaServiceBinding(bindingType = "bolt", timeout = 20000)}) public class TestApiImpl implements TestApi { @Override public String test() { return "hello"; } }

============================================== Is this information enough? If not, I will try to extract the demo and upload it. thks a lot. :-)

ujjboy commented 6 years ago

@lxl0044 I need more logs in ~/logs/rpc, such as rpc-client.log & rpc-registry.log.

lxl0044 commented 6 years ago

@ujjboy hi, i upload my demo, link: https://github.com/lxl0044/test u can do like this: 1, run service; 2, run client, and call controller; 3, kill -9 ${service.pid} && restart it 4, continue calling client's controller, about some times later , it will happen;

any questions call me any time, thks a lot. :-)

lxl0044 commented 6 years ago

Hi @ujjboy @lifesinger @popomore @khotyn , Dear all, any progress? did it happen? Anything i could do? thks a lot. :-)

ujjboy commented 6 years ago

@lxl0044 I will follow this case. Can you upload the log file ~/rpc/rpc-registry.log?

liangyuanpeng commented 6 years ago

@lxl0044

I tried your code and nothing happened,

something like that: 1, run service; 2, run client, and call controller; 3, kill -9 ${service.pid} && restart it 4, continue calling client's controller, about some times later , it will happen;

I just change

com:
  alipay:
    sofa:
      rpc:
        registry:
          address: zookeeper://10.1.13.199:2181/xiao

to

com:
  alipay:
    sofa:
      rpc:
        registry:
          address: zookeeper://localhost:2181
ujjboy commented 6 years ago

After I review the code, maybe there is one possibility: the old provider(killed server) deleted event is after new provider (restarted server) added event.

@lxl0044 I need your log in ~/rpc/rpc-registry.log to confirm it.

lxl0044 commented 6 years ago

Hi @ujjboy , I push them to my test project's logs , u can see it and any needs ask me anytime, thks very much.

lxl0044 commented 6 years ago

Dear @liangyuanpeng ,

this is my detailed steps: 1, run service && run client; 2, open a new window, run a while true script, just like while true; do curl 'http://127.0.0.1:10001/client/test?param=1'; echo ; sleep 1; done 3, this script will running normal and print 1 every sec; 4, kill -9 ${service.pid}, and you can see the script will response error; 5, then rerun service; 6, you can see the script back to normal; 7, but not good for a long time, some times later , this script will response error again;

you can try like it, i'm not sure what's wrong, but really error, so sad..

ujjboy commented 6 years ago

@lxl0044 I read this log, and as I guess, the remove event is later than the add event. I will new issue and fix it.

2018-09-14 09:24:33,451 INFO  Curator-PathChildrenCache-2      com.alipay.sofa.rpc.registry.zk.ZookeeperConfigObserver - Receive add provider: path=[/xiao/sofa-rpc/com.xiao.test.api.SofaApi/providers/bolt%3A%2F%2F10.200.101.197%3A12205%3Fversion%3D1.0%26uniqueId%3D%26timeout%3D20000%26delay%3D-1%26id%3Drpc-cfg-0%26dynamic%3Dtrue%26weight%3D100%26accepts%3D100000%26startTime%3D1536888272713%26appName%3Dtest-sofa-service%26serialization%3Dhessian2%26pid%3D32634%26language%3Djava%26rpcVer%3D50405], data=[], stat=[133394535,133394535,1536888273175,1536888273175,0,0,0,72765356844127291,1,0,133394535
]
2018-09-14 09:24:52,428 INFO  Curator-PathChildrenCache-2      com.alipay.sofa.rpc.registry.zk.ZookeeperConfigObserver - Receive remove provider: path=[/xiao/sofa-rpc/com.xiao.test.api.SofaApi/providers/bolt%3A%2F%2F10.200.101.197%3A12205%3Fversion%3D1.0%26uniqueId%3D%26timeout%3D20000%26delay%3D-1%26id%3Drpc-cfg-0%26dynamic%3Dtrue%26weight%3D100%26accepts%3D100000%26startTime%3D1536888195935%26appName%3Dtest-sofa-service%26serialization%3Dhessian2%26pid%3D32318%26language%3Djava%26rpcVer%3D50405], data=[], stat=[133394000,133394000,1536888196390,1536888196390,0,0,0,72765356844127289,1,0,133394000
]
lxl0044 commented 6 years ago

@ujjboy Good, come on! Any progress call me, pls. Thks very much!! :-)