smallnest / rpcx

Best microservices framework in Go, like alibaba Dubbo, but with more features, Scale easily. Try it. Test it. If you feel it's better, use it! 𝐉𝐚𝐯𝐚有𝐝𝐮𝐛𝐛𝐨, 𝐆𝐨𝐥𝐚𝐧𝐠有𝐫𝐩𝐜𝐱! build for cloud!
https://rpcx.io
Other
8.11k stars 1.17k forks source link

rpcx客户端报错后,再也不调用自定义路由的UpdateServer #795

Closed helisong427 closed 6 months ago

helisong427 commented 1 year ago

依赖库: github.com/rpcxio/rpcx-etcd v0.2.0 github.com/smallnest/rpcx v1.7.3 go.etcd.io/etcd/client/v3 v3.5.4 github.com/rpcxio/libkv v0.5.1-0.20210420120011-1fceaedca8a5

客户端初始化代码:

func InitClient(param *ClientParam) client.XClient { 
    // 启用心跳
    option := client.DefaultOption 
    option.Heartbeat = true 
    option.HeartbeatInterval = 3 * time.Second 
    option.MaxWaitForHeartbeat = 5 * time.Second 
    option.IdleTimeout = 5 * time.Second 
    d, _ := etcd_client.NewEtcdV3Discovery(param.BasePath, param.ServerBaseName.ToString(), param.EtcdAddr, true, &store.Config{ 
        ConnectionTimeout: 2 * time.Second, 
    }) 
    c := client.NewXClient(param.ServerBaseName.ToString(), client.Failtry, client.SelectByUser, d, option)
    // 启动自定义路由
    c.SetSelector(param.Selector)
    return c
}

问题:rpcx客户端报如下错误后,再也不调用自定义路由的UpdateServer,导致rpc请求找不到重新启动的服务端。 {"level":"warn","ts":"2023-03-29T11:34:09.621+0800","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x14002304000/192.168.18.78:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}

调试情况感觉是etcd断开后就没有重连,帮忙看一下是否是配置参数设置的有问题?

smallnest commented 1 year ago

rpcx-etcd的log日志呢?看起来只能是rpcx-etcd没有重新watch

helisong427 commented 1 year ago

就只有这一个日志

helisong427 commented 1 year ago

现在调试出现另外一下问题:updateServer函数被调用,但是更新内容后面一直是空的(实际是有服务端注册的,重启就正确了)。调试日志: 0329 13:19:34.992 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305","serviceId=115":"tcp@127.0.0.1:2200"} 0329 13:19:39.989 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305","serviceId=115":"tcp@127.0.0.1:2200"} 0329 13:19:44.991 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305","serviceId=115":"tcp@127.0.0.1:2200"} 0329 13:19:52.371 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305"} 0329 13:20:06.166 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305"} 0329 13:21:06.181 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305"} 0329 13:22:06.191 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305"} 0329 13:23:06.191 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305"} 0329 13:23:53.732 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305","serviceId=115":"tcp@127.0.0.1:2200"} 0329 13:23:53.778 INF ========> updateServer S=api_demo_1 service={"serviceId=1":"tcp@192.168.18.77:19305","serviceId=115":"tcp@127.0.0.1:2200"} 0329 13:23:58.685 INF ========> updateServer S=api_demo_1 service={} 0329 13:24:03.676 INF ========> updateServer S=api_demo_1 service={} 0329 13:24:06.156 INF ========> updateServer S=api_demo_1 service={} 0329 13:24:08.681 INF ========> updateServer S=api_demo_1 service={} 0329 13:24:13.679 INF ========> updateServer S=api_demo_1 service={}

开始描述的问题可能跟我断点调试有关,不打断点情况下没有再出现了。 但是运行一段时间,updateServer函数更新空数据过来,这个是必现的。

smallnest commented 1 year ago

etcd集群有什么问题吗?

helisong427 commented 1 year ago

etcd使用的是单节点,测试的时候没有打印日志(已有日志都是以前的)。

helisong427 commented 1 year ago

经过调试,etcdV3初始化调整一下就好了。这样调整:

d, _ := etcd_client.NewEtcdV3Discovery(param.BasePath, param.ServerBaseName.ToString(), param.EtcdAddr, true, &store.Config{
        ConnectionTimeout: 2 * time.Second,
    })

改为

d, _ := etcd_client.NewEtcdV3Discovery(param.BasePath, param.ServerBaseName.ToString(), param.EtcdAddr, true, nil)

但是这个调整是为了解决etcd集群节点CPU占用过高的问题,当时每个节点CPU占到600%左右。如果我们改回去,CPU占用高的问题又出现了,希望您给我一些建议,谢谢

smallnest commented 1 year ago

ConnectionTimeout 超时设置大一些呢,比如1分钟?

是因为rpcx无限次的重连导致的etcd集群高?

helisong427 commented 1 year ago

经过调试,etcdV3初始化调整一下就好了。这样调整:

d, _ := etcd_client.NewEtcdV3Discovery(param.BasePath, param.ServerBaseName.ToString(), param.EtcdAddr, true, &store.Config{
      ConnectionTimeout: 2 * time.Second,
  })

改为

d, _ := etcd_client.NewEtcdV3Discovery(param.BasePath, param.ServerBaseName.ToString(), param.EtcdAddr, true, nil)

但是这个调整是为了解决etcd集群节点CPU占用过高的问题,当时每个节点CPU占到600%左右。如果我们改回去,CPU占用高的问题又出现了,希望您给我一些建议,谢谢

基本确定问题点了:由于我们的etcd使用的是服务器上的单节点,而我们有指定了store.Config{}参数,导致一下代码生效

    if options != nil {
        s.timeout = options.ConnectionTimeout
        cfg.DialTimeout = options.ConnectionTimeout
        cfg.DialKeepAliveTimeout = options.ConnectionTimeout
        cfg.TLS = options.TLS
        cfg.Username = options.Username
        cfg.Password = options.Password

        cfg.AutoSyncInterval = EtcdConfigAutoSyncInterval
    }

这里就对cfg.AutoSyncInterval 进行了赋值,导致客户端会同步etcd的集群节点,由于是单节点会拿到一个127.0.0.1:2379的地址(这个我本地的etcd)用来同步,我本地的etcd里面没有所注册的信息,导致同步了空数据过来。如果我把本地的etcd停掉,则包链接127.0.0.1:2379被拒绝。

建议:指定了store.Config{}参数后,不要对cfg.AutoSyncInterval进行赋值,因为你不知道我们使用的etcd是单节点还是集群。

dickens7 commented 6 months ago

cfg.AutoSyncInterva 这个配置和 【单节点】【集群】没关系, 与 etcd 的配置 ETCD_ADVERTISE_CLIENT_URLS 有关