hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

Consul connection error on port 8300 #14464

Closed HalacliSeda closed 1 year ago

HalacliSeda commented 2 years ago

Hello,

I use Consul 1.13.1 I have two server (as an example): 10.10.10.1, 10.10.10.2 I set up consul server on both. consul.json are same on both: { "bind_addr": "10.10.10.1", "client_addr": "0.0.0.0", "datacenter": "datacenter-01", "bootstrap_expect": 3, "data_dir": "/var/lib/consul", "encrypt": "", "disable_update_check": true, "server": true, "ui": true, "rejoin_after_leave": true, "retry_join": ["10.10.10.1","10.10.10.2","......."], "acl": { "enabled": true, "default_policy": "deny", "tokens": { "agent": "" } } } { "bind_addr": "10.10.10.2", "client_addr": "0.0.0.0", "datacenter": "datacenter-01", "bootstrap_expect": 3, "data_dir": "/var/lib/consul", "encrypt": "", "disable_update_check": true, "server": true, "ui": true, "rejoin_after_leave": true, "retry_join": ["10.10.10.1","10.10.10.2","......."], "acl": { "enabled": true, "default_policy": "deny", "tokens": { "agent": "" } } }

consul members output like that: Node Address Status Type Build Protocol DC Partition Segment ha1 10.10.10.1:8301 alive server 1.13.1 2 datacenter-01 default ha2 10.10.10.2:8301 alive server 1.13.1 2 datacenter-01 default

But I got an error both server like that: [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {ha1:8300 ha1.compute.internal 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 10.10.10.2:0->10.10.10.1:8300: operation was canceled". Reconnecting...

Port 8300 used for consul service on both server. I check ports with telnet and there is no problem: telnet 10.10.10.1 8300 Trying 10.10.10.1... Connected to 10.10.10.1. Escape character is '^]'.

I did not get an error in Consul 1.12.1. Is this a bug in Consul 1.13.1 ?

Thanks, Seda

Serg2294 commented 2 years ago

I have the same issue.

Din-He commented 2 years ago

我在k8s中使用helm部署consul集群时遇到相同的问题

obourdon commented 2 years ago

Same here with consul 1.11.4

Seen in the changelog: Fixed in 1.12.1: rpc: Adds a deadline to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. [GH-8504] [GH=11500] Fixed in 1.12.3: deps: Update go-grpc/grpc, resolving connection memory leak [GH-13051]

Not sure though that these are related

obourdon commented 2 years ago

Just to me more complete on this, after seeing a lot of these errors, even a call to localhost:8500 just fails

obourdon commented 2 years ago

upgrade to 1.12.3 did not help, same errors across cluster of consul servers+clients

obourdon commented 2 years ago

on my side and in the contrary of @HalacliSeda 1.12.1 does also seem to have the issue even if less frequently

obourdon commented 2 years ago

Also tried latest 1.13.2 with same results :-(

quinndiggitypolymath commented 2 years ago

yeah, seeing these as well, intermittently - as @obourdon mentioned, this seems to be related to the timeouts/aborts that were recently added; my prior clusters don't experience these disconnects

all in all, the functionality of the clusters logging these messages aren't otherwise affected, so this seems to be due to overly aggressive timeouts - there was a recent refactor around rpc timeouts + the addition of limits.rpc_client_timeout (defaulting to 60s): https://github.com/hashicorp/consul/pull/14965

hopefully easing the timeouts resolves these errors

Din-He commented 2 years ago

@quinndiggitypolymath 看您这意思,这个[WARN]信息不影响集群的正常使用对吗?

obourdon commented 2 years ago

@quinndiggitypolymath many thanks for these very valuable infos.

However there are cases where after quite a while it seems that even accessing port 8500 localy just fails as mentioned here

Furthermore this does not seem "recent" as the list of impacted versions seems to to prove

Could you please explain in more details what you meant by easing the timeouts resolves these errors ? Is there some configuration we can set to avoid these errors ? Like increasing the limits.rpc_client_timeout to 120 or 180 seconds ? What would be the (other) impact(s)/risk(s) of doing so ?

Many thanks again

quinndiggitypolymath commented 2 years ago

@Din-He, at least this particular message operation was canceled on its own doesn't seem to indicate a specific problem (to me); I am seeing the same message being logged, and still have functional clusters (in terms of service resolution/mesh network traffic flow/key-value/distributed locks, etc) - for me, nomad is still able to schedule services, and those services are functioning correctly, vault is operational, etc

@obourdon, that sounds like the messages may be a symptom of another issue (or multiple issues) - consul has a lot of areas where things can be broken if not configured exactly right, and the logging could be better in some spots when debugging. Without knowing what your configuration is like, I would recommend adjusting the logging level https://developer.hashicorp.com/consul/docs/agent/config/config-files#log_level to debug (or trace, if you need more verbosity; remember to return to info or warn as the log volume can be enormous) to see if that shakes out any specific errors; be sure to double check the process isn't being restarted/stopped/stalling/crashing under whatever means it is being run. Ensuring the underlying storage volume has enough throughput/iops is important, and that all network traffic can be sent/received through the network https://developer.hashicorp.com/consul/docs/install/ports If you are utilizing containers, ensuring that consul isn't listening only on 127.0.0.1 (unless you have an arrangement set up to make that work through DNAT, etc). Check that the cluster is healthy https://developer.hashicorp.com/consul/api-docs/status#get-raft-leader and recover if not https://learn.hashicorp.com/tutorials/consul/recovery-outage If you are (hopefully you are) using encryption https://developer.hashicorp.com/consul/docs/agent/config/config-files#encrypt mTLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_outgoing (or just TLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_cert_file ) for everything, ensure that your certificate chains are proper and pass verification https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_incoming

Furthermore this does not seem "recent" as the list of impacted versions seems to to prove

Hashicorp (I am not affiliated) supports the last 2 releases of consul https://support.hashicorp.com/hc/en-us/articles/360021185113-Support-Period-and-End-of-Life-EOL-Policy so I've been on 1.11 due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk https://developer.hashicorp.com/consul/docs/connect/ca/vault#rootpkipath but Cluster Peering https://developer.hashicorp.com/consul/docs/connect/cluster-peering effectively replaces the per-datacenter Vault arrangement + is how I would have preferred Connect work in the first place :smiling_face_with_tear: ). This error I have not seen on 1.11 or before, but as 1.14 beta is out, I will need to be on 1.12 or above soon; am nearly done moving fully to 1.13 (or 1.14 to utilize the Peering Service Mesh setup, once that matures a little more), so I haven't as thoroughly evaluated this particular error with a production workload on versions between

what you meant by easing the timeouts resolves these errors

Essentially, if the limit is being hit, slightly increase that limit (test/record metrics before + after); if you have particularly slow request, where hitting 60s causes it to abort sometimes but it needs only ~5s more (for whatever reason, say running on an ARM device with slow storage) a jump to 90s might handle that. Going overboard with that can be bad; 60s is the status quo without overriding

What would be the (other) impact(s)/risk(s) of doing so ?

Increased resource usage, more sockets, more memory, more load, etc; under failure modes it could have cascading effects, all those sorts of things, on top of it taking longer to know something is wrong (if the request won't actually ever succeed, failing faster would allow retries + potentially freeing up resources). As with any change, measure before and after, and refine; if it needs 65s, 90s is overkill in that scenario, so reduce and measure again

Din-He commented 2 years ago

@quinndiggitypolymath very thank you!哈哈哈

Din-He commented 2 years ago

请教各位大佬一个新的问题。 我在k8s上使用helm部署consul(为了方便,k8s只有一个master节点),在helm的value.yaml文件中配置开启了consul的acls,gossipEncryption: autoGenerate: true acls: manageSystemACLs: true 部署是好的,查看k8s中pod都是正常的。他自动生成了一些token。如下图 1666334903096 然后我在springboot程序中使用global-management token将微服务注册到我部署的consul上,他报了一个错 token with AccessorID '00000000-0000-0000-0000-000000000002' lacks permission 'service:write' on "demo20221017" 其中demo20221017是我的服务名,看这句话意思是AccessorID=002的这个token没有写入的权限,但是我并没有用002的这个token呀,我用的是global-management token。不知道是什么原因,大佬们清楚这是怎么回事吗?感谢解答。

quinndiggitypolymath commented 2 years ago

@Din-He, token with AccessorID '00000000-0000-0000-0000-000000000002' is the default anonymous token ( https://developer.hashicorp.com/consul/docs/security/acl/acl-tokens#anonymous-token ) , meaning the node itself is trying to service:write on demo20221017 without a token being provided; double check that you are setting: https://developer.hashicorp.com/consul/docs/agent/config/config-files#acl_tokens

You will need a token for the node, and a policy attached to it; your policy may look along the lines of:

service "demo20221017" {
  policy = "write"
}
service "demo20221017-sidecar-proxy" {
  policy = "write"
}

but refer to the following for specifics:

ncode commented 1 year ago

I'm also having the same issue with 1.14.2. I'm playing with the rpc_client_timeout, but no luck so far.

jkirschner-hashicorp commented 1 year ago

so I've been on 1.11 due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk)

@quinndiggitypolymath : Can you share more about the Connect CA change that made your 1 Vault cluster : 1 Consul cluster setup stop working? I had thought that WAN federated Consul clusters could use different Vault clusters. And if they can't in your experience, that's something I'm interested in following up on. It would be preferable from a latency and resilience perspective to have a Vault cluster in the same region as the Consul cluster it acts as the Connect CA for.

obourdon commented 1 year ago

is this somehow related to issue #10603 ???

obourdon commented 1 year ago

Seems like migrating to consul 1.14.4 fixes this issue on my side

tunguyen9889 commented 1 year ago

Seems like migrating to consul 1.14.4 fixes this issue on my side

Yes, I confirmed 1.14.4 fixed this warning message.

obourdon commented 1 year ago

In fact, after 1 night of operations, it is drastically reduced but still present. It went down from 100-150 occurences/hour down to 1 or 2 each and every 2/3 hour (previously installed version was 1.14.3)

david-yu commented 1 year ago

Thanks @obourdon. This does seem to be a dupe of #10603 which was just closed. Please note that still occurs for agent startups which is why you likely see still see this issue and is tracked here: https://github.com/hashicorp/consul/issues/15821. I'll go ahead and close this issue as there is now a separate issue tracking the agent startup WARN logs.