Closed HalacliSeda closed 1 year ago
I have the same issue.
我在k8s中使用helm部署consul集群时遇到相同的问题
Same here with consul 1.11.4
Seen in the changelog: Fixed in 1.12.1: rpc: Adds a deadline to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. [GH-8504] [GH=11500] Fixed in 1.12.3: deps: Update go-grpc/grpc, resolving connection memory leak [GH-13051]
Not sure though that these are related
Just to me more complete on this, after seeing a lot of these errors, even a call to localhost:8500 just fails
upgrade to 1.12.3 did not help, same errors across cluster of consul servers+clients
on my side and in the contrary of @HalacliSeda 1.12.1 does also seem to have the issue even if less frequently
Also tried latest 1.13.2 with same results :-(
yeah, seeing these as well, intermittently - as @obourdon mentioned, this seems to be related to the timeouts/aborts that were recently added; my prior clusters don't experience these disconnects
all in all, the functionality of the clusters logging these messages aren't otherwise affected, so this seems to be due to overly aggressive timeouts - there was a recent refactor around rpc timeouts + the addition of limits.rpc_client_timeout
(defaulting to 60s
): https://github.com/hashicorp/consul/pull/14965
hopefully easing the timeouts resolves these errors
@quinndiggitypolymath 看您这意思,这个[WARN]信息不影响集群的正常使用对吗?
@quinndiggitypolymath many thanks for these very valuable infos.
However there are cases where after quite a while it seems that even accessing port 8500 localy just fails as mentioned here
Furthermore this does not seem "recent" as the list of impacted versions seems to to prove
Could you please explain in more details what you meant by easing the timeouts resolves these errors
?
Is there some configuration we can set to avoid these errors ? Like increasing the limits.rpc_client_timeout to 120 or 180 seconds ?
What would be the (other) impact(s)/risk(s) of doing so ?
Many thanks again
@Din-He, at least this particular message operation was canceled
on its own doesn't seem to indicate a specific problem (to me); I am seeing the same message being logged, and still have functional clusters (in terms of service resolution/mesh network traffic flow/key-value/distributed locks, etc) - for me, nomad is still able to schedule services, and those services are functioning correctly, vault is operational, etc
@obourdon, that sounds like the messages may be a symptom of another issue (or multiple issues) - consul has a lot of areas where things can be broken if not configured exactly right, and the logging could be better in some spots when debugging. Without knowing what your configuration is like, I would recommend adjusting the logging level https://developer.hashicorp.com/consul/docs/agent/config/config-files#log_level to debug
(or trace
, if you need more verbosity; remember to return to info
or warn
as the log volume can be enormous) to see if that shakes out any specific errors; be sure to double check the process isn't being restarted/stopped/stalling/crashing under whatever means it is being run. Ensuring the underlying storage volume has enough throughput/iops is important, and that all network traffic can be sent/received through the network https://developer.hashicorp.com/consul/docs/install/ports If you are utilizing containers, ensuring that consul isn't listening only on 127.0.0.1 (unless you have an arrangement set up to make that work through DNAT, etc). Check that the cluster is healthy https://developer.hashicorp.com/consul/api-docs/status#get-raft-leader and recover if not https://learn.hashicorp.com/tutorials/consul/recovery-outage If you are (hopefully you are) using encryption https://developer.hashicorp.com/consul/docs/agent/config/config-files#encrypt mTLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_outgoing (or just TLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_cert_file ) for everything, ensure that your certificate chains are proper and pass verification https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_incoming
Furthermore this does not seem "recent" as the list of impacted versions seems to to prove
Hashicorp (I am not affiliated) supports the last 2 releases of consul https://support.hashicorp.com/hc/en-us/articles/360021185113-Support-Period-and-End-of-Life-EOL-Policy so I've been on 1.11
due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk https://developer.hashicorp.com/consul/docs/connect/ca/vault#rootpkipath but Cluster Peering https://developer.hashicorp.com/consul/docs/connect/cluster-peering effectively replaces the per-datacenter Vault arrangement + is how I would have preferred Connect work in the first place :smiling_face_with_tear: ). This error I have not seen on 1.11
or before, but as 1.14 beta
is out, I will need to be on 1.12
or above soon; am nearly done moving fully to 1.13
(or 1.14
to utilize the Peering Service Mesh setup, once that matures a little more), so I haven't as thoroughly evaluated this particular error with a production workload on versions between
what you meant by
easing the timeouts resolves these errors
Essentially, if the limit is being hit, slightly increase that limit (test/record metrics before + after); if you have particularly slow request, where hitting 60s
causes it to abort sometimes but it needs only ~5s more (for whatever reason, say running on an ARM device with slow storage) a jump to 90s
might handle that. Going overboard with that can be bad; 60s
is the status quo without overriding
What would be the (other) impact(s)/risk(s) of doing so ?
Increased resource usage, more sockets, more memory, more load, etc; under failure modes it could have cascading effects, all those sorts of things, on top of it taking longer to know something is wrong (if the request won't actually ever succeed, failing faster would allow retries + potentially freeing up resources). As with any change, measure before and after, and refine; if it needs 65s
, 90s
is overkill in that scenario, so reduce and measure again
@quinndiggitypolymath very thank you!哈哈哈
请教各位大佬一个新的问题。 我在k8s上使用helm部署consul(为了方便,k8s只有一个master节点),在helm的value.yaml文件中配置开启了consul的acls,gossipEncryption: autoGenerate: true acls: manageSystemACLs: true 部署是好的,查看k8s中pod都是正常的。他自动生成了一些token。如下图 然后我在springboot程序中使用global-management token将微服务注册到我部署的consul上,他报了一个错 token with AccessorID '00000000-0000-0000-0000-000000000002' lacks permission 'service:write' on "demo20221017" 其中demo20221017是我的服务名,看这句话意思是AccessorID=002的这个token没有写入的权限,但是我并没有用002的这个token呀,我用的是global-management token。不知道是什么原因,大佬们清楚这是怎么回事吗?感谢解答。
@Din-He, token with AccessorID '00000000-0000-0000-0000-000000000002'
is the default anonymous
token ( https://developer.hashicorp.com/consul/docs/security/acl/acl-tokens#anonymous-token ) , meaning the node itself is trying to service:write
on demo20221017
without a token being provided; double check that you are setting: https://developer.hashicorp.com/consul/docs/agent/config/config-files#acl_tokens
You will need a token for the node, and a policy attached to it; your policy may look along the lines of:
service "demo20221017" {
policy = "write"
}
service "demo20221017-sidecar-proxy" {
policy = "write"
}
but refer to the following for specifics:
I'm also having the same issue with 1.14.2. I'm playing with the rpc_client_timeout, but no luck so far.
so I've been on 1.11 due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk)
@quinndiggitypolymath : Can you share more about the Connect CA change that made your 1 Vault cluster : 1 Consul cluster setup stop working? I had thought that WAN federated Consul clusters could use different Vault clusters. And if they can't in your experience, that's something I'm interested in following up on. It would be preferable from a latency and resilience perspective to have a Vault cluster in the same region as the Consul cluster it acts as the Connect CA for.
is this somehow related to issue #10603 ???
Seems like migrating to consul 1.14.4 fixes this issue on my side
Seems like migrating to consul 1.14.4 fixes this issue on my side
Yes, I confirmed 1.14.4 fixed this warning message.
In fact, after 1 night of operations, it is drastically reduced but still present. It went down from 100-150 occurences/hour down to 1 or 2 each and every 2/3 hour (previously installed version was 1.14.3)
Thanks @obourdon. This does seem to be a dupe of #10603 which was just closed. Please note that still occurs for agent startups which is why you likely see still see this issue and is tracked here: https://github.com/hashicorp/consul/issues/15821. I'll go ahead and close this issue as there is now a separate issue tracking the agent startup WARN logs.
Hello,
I use Consul 1.13.1 I have two server (as an example): 10.10.10.1, 10.10.10.2 I set up consul server on both. consul.json are same on both: { "bind_addr": "10.10.10.1", "client_addr": "0.0.0.0", "datacenter": "datacenter-01", "bootstrap_expect": 3, "data_dir": "/var/lib/consul", "encrypt": "",
"disable_update_check": true,
"server": true,
"ui": true,
"rejoin_after_leave": true,
"retry_join": ["10.10.10.1","10.10.10.2","......."],
"acl": {
"enabled": true,
"default_policy": "deny",
"tokens": {
"agent": ""
}
}
}
{
"bind_addr": "10.10.10.2",
"client_addr": "0.0.0.0",
"datacenter": "datacenter-01",
"bootstrap_expect": 3,
"data_dir": "/var/lib/consul",
"encrypt": "",
"disable_update_check": true,
"server": true,
"ui": true,
"rejoin_after_leave": true,
"retry_join": ["10.10.10.1","10.10.10.2","......."],
"acl": {
"enabled": true,
"default_policy": "deny",
"tokens": {
"agent": ""
}
}
}
consul members output like that: Node Address Status Type Build Protocol DC Partition Segment ha1 10.10.10.1:8301 alive server 1.13.1 2 datacenter-01 default
ha2 10.10.10.2:8301 alive server 1.13.1 2 datacenter-01 default
But I got an error both server like that: [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {ha1:8300 ha1.compute.internal 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 10.10.10.2:0->10.10.10.1:8300: operation was canceled". Reconnecting...
Port 8300 used for consul service on both server. I check ports with telnet and there is no problem: telnet 10.10.10.1 8300 Trying 10.10.10.1... Connected to 10.10.10.1. Escape character is '^]'.
I did not get an error in Consul 1.12.1. Is this a bug in Consul 1.13.1 ?
Thanks, Seda