hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

Closed shellfu closed 1 year ago

shellfu commented 3 years ago

Note from @lkysow: I'm moving this to hashicorp/consul because the discuss post shows a user on EC2 also saw this error.

Overview of the Issue

New 1.10.0 on New K8s Cluster results in [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.65.16:8300 0 consul-server-2.primary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.65.16:8300: operation was canceled". Reconnecting...

These WARNS appear in both the server and clients.

Reproduction Steps

  1. When running helm install with the following values.yml:
    client:
    enabled: true
    grpc: true
    connectInject:
    aclBindingRuleSelector: serviceaccount.name!=default
    default: false
    enabled: true
    metrics:
    defaultEnableMerging: true
    defaultEnabled: true
    defaultMergedMetricsPort: 20100
    defaultPrometheusScrapePath: /metrics
    defaultPrometheusScrapePort: 20200
    transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
    controller:
    enabled: true
    dns:
    enabled: true
    global:
    acls:
    createReplicationToken: true
    manageSystemACLs: true
    datacenter: primary
    enabled: true
    federation:
    createFederationSecret: true
    enabled: true
    gossipEncryption:
    secretKey: key
    secretName: consul-gossip-encryption-key
    image: hashicorp/consul:1.10.0
    imageEnvoy: envoyproxy/envoy-alpine:v1.18.3
    imageK8S: hashicorp/consul-k8s:0.26.0
    logJSON: true
    metrics:
    agentMetricsRetentionTime: 1m
    enableAgentMetrics: false
    enableGatewayMetrics: true
    enabled: true
    name: consul
    tls:
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs:
    - '*.consul'
    - '*.svc.cluster.local'
    - '*.my.customdomain.com'
    verify: false
    meshGateway:
    enabled: true
    service:
    enabled: true
    port: 443
    type: LoadBalancer
    wanAddress:
    port: 443
    source: Service
    server:
    bootstrapExpect: 5
    connect: true
    disruptionBudget:
    enabled: true
    maxUnavailable: 2
    enabled: true
    extraConfig: "{\n  \"primary_datacenter\": \"primary\",\n  \"performance\": {\n
    \     \"raft_multiplier\": 3\n  },\n  \"dns_config\": {\n    \"allow_stale\":
    true,\n    \"cache_max_age\": \"10s\",\n    \"enable_additional_node_meta_txt\":
    false,\n    \"node_ttl\": \"1m\",\n    \"soa\": {\n        \"expire\": 86400,
    \n        \"min_ttl\": 30,\n        \"refresh\": 3600,\n        \"retry\": 600\n
    \   },\n    \"use_cache\": true\n}}"
    replicas: 5
    resources:
    limits:
      cpu: 500m
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 10Gi
    storage: 10Gi
    updatePartition: 0
    syncCatalog:
    default: true
    enabled: true
    nodePortSyncType: ExternalFirst
    syncClusterIPServices: true
    toConsul: true
    toK8S: true
    ui:
    enabled: true
    metrics:
    baseURL: http://mon-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local
    enabled: true
    provider: prometheus
    service:
    enabled: true
    type: NodePort

Expected behavior

WARNS should not be flooding the log and connections should be over 8301 not 8300

Environment details

If not already included, please provide the following:

Additional Context

It seems others are experiencing the same problem. https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237

lkysow commented 3 years ago

Hi, based on https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237 it sounds like this issue is not specific to Kubernetes. I'm going to move this to hashicorp/consul.

dnephin commented 3 years ago

Thank you for reporting this issue!

I was just running a Consul agent locally to debug a different issue and I noticed this problem happens at the same time as these 2 debug lines:

2021-07-13T20:26:20.707Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=2 active_server="a19bd98836ec.dc1 (Addr: tcp/172.20.0.2:8300) (DC: dc1)"
2021-07-13T20:26:20.707Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {172.20.0.3:8300 0 0cc9dd0254a2.dc1 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.20.0.3:8300: operation was canceled". Reconnecting...
2021-07-13T20:26:20.707Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=1 active_server="a19bd98836ec (Addr: tcp/172.20.0.2:8300) (DC: dc1)"

The problem seems to be that when we rebalance the servers the active transport is cancelled, which causes this error to be printed.

mikemorris commented 3 years ago

Is the issue here that the behavior is potentially incorrect, or that a common occurrence is erroneously categorized at WARN log level?

shellfu commented 3 years ago

I've installed in a couple other locations with the same chart/values as above and in the same datacenter the warn messages are for the other consul-servers in the cluster. This occurs if the cluster is WAN federated or not, that doesn't appear to have an impact.

Currently, trying to track a couple of network issues I have been experiencing in consul 1.10.

I am trying to obtain more evidence but I deleted the 1.10 cluster and went back to 1.8.4 and it did not appear to have the WARN. Can this be ignored? Not sure yet.

ikonia commented 3 years ago

seeing an exact mirror of of this problem on a small development cluster running on Raspberry PI 4's in a very basic configuration all running consul 1.10.1

the errors in my case are the server taking to itself,

eg: my 3 raft servers are made up of 3 nodes - called:

nog wesley jake

Node ID Address State Voter RaftProtocol wesley.no-dns.co.uk 5e8a186b-adb5-ebba-eeb4-e10656568adf 10.11.216.81:8300 leader true 3 nog.no-dns.co.uk 086a7491-bf09-c7e2-9151-74c817ffb74c 10.11.216.182:8300 follower true 3 jake.no-dns.co.uk aa37ec78-d438-8726-77e5-c5619dfb054a 10.11.216.234:8300 follower true 3

in nog's log Jul 21 10:21:35 nog consul[5613]: 2021-07-21T10:21:35.048Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.182:8300 0 nog.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.182:8300: operation was canceled". Reconnecting...

the IP address 10.11.216.182 is actually the IP address of the host 'nog' - so the error is talking to itself

on the host 'jake' the log shows the same failure to connect to the host nog Jul 21 10:17:17 jake consul[7781]: 2021-07-21T10:17:17.833Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.182:8300 0 nog.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.182:8300: operation was canceled". Reconnecting...

on the host wesley (leader)

Jul 21 10:19:19 wesley consul[15814]: 2021-07-21T10:19:19.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.81:8300 0 wesley.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.81:8300: operation was canceled". Reconnecting...

the 10.11.216.81 IP address it's failing to talk to is wesley - itself in this case.

shellfu commented 3 years ago

Yeah, that is what I am seeing. It's the local cluster that is emitting these messages and FLOODING the logs

kornface13 commented 3 years ago

I see the same issue. Three node cluster running on VMs (Cent 8).

Consul v1.10.0 Revision 27de64da7 Running on Cent 8 VMs.

All three nodes sporadically spit out an error about connecting to one of the other master nodes.

agent: grpc: addrConn.createTransport failed to connect to {10.248.14.54:8300 0 consul02.c.blah.internal <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.248.14.54:8300: operation was canceled". Reconnecting...

shellfu commented 3 years ago

I went ahead and did more installation tests.

I installed Consul 1.10.1 and Chart 0.32.1 and backed down the Consul and Chart version all the way to 1.8.4 and 0.26.0 as I am also experiencing other problems that are not related to this issue.

The WARNS appear in the latest 1.10.x versions and they are emitted in both the local datacenter as well as a federated environment using mesh gateways.

sri4kanne commented 3 years ago

We are also seeing similar errors but there seems to be no issue with the cluster itself, we are running consul v1.10.1 on OL8 VM's.

therealhanlin commented 3 years ago

@ikonia I'm running Consul on two Ras Pis, and ran into this issue couple of days ago as well. In case you haven't found a solution, I seems to find the cause for my issue.

I first noticed the inconsistency in member status shown on each Pi. As you can see from snippets below, Pi 1 seems to think that 02 is leaving the cluster, and 02 thinks it's still in it. So I restarted the consul service on 02 and that fixed the issue.

I think the problem was caused due to starting services on both Pis at the same time and the nodes didn't negotiate properly and somehow that caused this weird bug. I use Ansible for staging the node and manage configs on them, and whenever I changed the configs, it restarts the services at the same time on all nodes and that's not a good idea .... (duh ... ).

I'm not sure what you setup is, but maybe try to spin up the nodes one by one, which solved the problem for me.

raspi01:
Node           Address            Status   Type    Build   Protocol  DC     Segment
raspi01.raspi  192.168.1.10:8302  alive    server  1.10.1  2         raspi  <all>
raspi02.raspi  192.168.1.11:8302  leaving  server  1.10.1  2         raspi  <all>

raspi02:
Node           Address            Status  Type    Build   Protocol  DC     Segment
raspi01.raspi  192.168.1.10:8302  alive   server  1.10.1  2         raspi  <all>
raspi02.raspi  192.168.1.11:8302  alive   server  1.10.1  2         raspi  <all>
shellfu commented 3 years ago

Restarts in the way you describe unfortunately does not solve the problem here. The clusters appear to be healthy otherwise, but this is flooding logs and I do not think we have received a response as to if the messages are indicative of an issue or is something that can be ignored and waiting for a patch.

dnephin commented 3 years ago

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

isality commented 3 years ago

+1 It seems like not ok.

consul v1.10.1

avoidik commented 3 years ago

we're having the same issue on ent version, will try to raise support ticket there

drawks commented 3 years ago

:100: this should be moved to an info level log, normal system behavior that doesn't result in any degradation and self heals should not be something we are warned about.

Peter-consul commented 3 years ago

I have the same issue after upgraded to 1.10.2.

weastur commented 3 years ago

@dnephin Is there any chance to fix this in an upcoming release?

jkirschner-hashicorp commented 3 years ago

To clarify our current understanding of this: this is not a bug, but instead a misclassified log message (that shouldn't be WARN).

Per @dnephin:

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

In this case, the aforementioned "need to do more investigation" is about how to make the change to reduce verbosity, not about the cause or whether there's a bug. The change requires some investigation because the message is emitted by gRPC, not Consul.

adamw-linadm commented 2 years ago

Exactly this same error on 1.11.1 bare metal/ centos 7

dnephin commented 2 years ago

If this log message was coming directly from Consul this would be much easier to fix. Unfortunately the log message is coming from a library (gRPC), which makes it a bit harder to fix.

I think we have two options for addressing this:

  1. change all gRPC WARN messages to INFO
  2. change how we modify the grpc ClientConnPool so that it does not warn

Option 1 is pretty safe, but I'm not sure if it fixes much. There will still be an INFO log message that is printed periodically. I guess it is slightly better to print this as an INFO than a WARN. The downside is that other gRPC WARN messages may not be visible enough in logs at INFO level.

Option 2 is much more involved, but is likely a safer long term fix. I believe the cause of this warning is this code: https://github.com/hashicorp/consul/blob/d20230fac1e89678ba6f5e26bad4d2fff99fe9f2/agent/grpc/resolver/resolver.go#L283-L287

If we trace those UpdateState calls, we'll see we can end up in the code that logs that warning. My rough understanding is that by updating the client conn state we are cancelling the dial operation, which prints this message. It may be that by calling UpdateState twice like this is what triggers the warning. Under normal operation I guess the dial would complete and we wouldn't see a warning. This also shows why we can't just hide the message, because if the dial operation was failing for some other reason, we'd want to know about it. The "further investigation" would be to see if we could remove the need to call UpdateState twice, and to confirm that is sufficient to prevent the warning.

idrennanvmware commented 2 years ago

We're seeing a pretty healthy amount of these messages as well across our clusters. Keeping our eyes on this.

Given the above, option 2 is definitely our preference. Not sure we want to even get a message in this case unless it's something to be concerned about

danlsgiga commented 2 years ago

Still happening on 1.11.2 running on my homelab.

+1 to Option 2

NagenderPulluri commented 2 years ago

Can we ignore this error if everything is working as expected ? or Do we need to concern about this warning/error ?

Amier3 commented 2 years ago

Hey @nagender1005

Yes you can ignore this warning if everything is working as expected. Per earlier in this thread:

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

Hope this helps!

chrisvanmeer commented 2 years ago

Still happening in Consul v1.11.3 as well.

scottnemes commented 2 years ago

Still happening in Consul v1.12.0 .

ikonia commented 2 years ago

we are approaching this error being open for a year now ? is there a clear understanding of it and an plan to resolve it now, even if it's just a miss-classified error it should be easy to remediate and remove the confusion

kisunji commented 2 years ago

Hey everyone, sorry for the inconvenience but as stated above, these WARNs are being emitted from the grpc library itself and we cannot easily suppress them without potentially hiding other valid WARN logs from grpc.

We are looking into Option 2 as outlined by @dnephin above but it is still an investigation in progress.

cr0c0dylus commented 2 years ago

Consul v1.12.0 Revision 09a8cdb4

May 17 11:22:15 master1 consul[1095]: 2022-05-17T11:22:15.647+0300 [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {fsn1-10.44.0.4:8300 master3.fsn1 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 10.44.0.2:0->10.44.0.4:8300: operation was canceled". Reconnecting...

cr0c0dylus commented 2 years ago

Consul v1.12.3 Revision 2308c75e

Jul 14 14:38:12 master2 consul[883807]: agent: [core]grpc: addrConn.createTransport failed to connect to {hel1-95.XX.XX.47:8300 master2.hel1 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 95.XX.XX.47:0->95.XX.XX.47:8300: operation was canceled". Reconnecting...

Jul 14 14:49:07 master2 consul[883807]: 2022-07-14T14:49:07.549+0300 [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {hel1-95.XX.XX.194:8300 master3.hel1 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 95.XX.XX.47:0->95.XX.XX.194:8300: operation was canceled". Reconnecting...

viniciusbmello commented 2 years ago

Consul v1.13.0 Revision 8c237209

2022-08-11T19:41:42.147Z [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-192.168.128.4:8300 ubconsul01 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp ->192.168.128.4:8300: operation was canceled". Reconnecting...

JonasSchlaak commented 2 years ago

In understand those warning messages don't come from consul itself, but from the used grpc library. However, I would expect that I could get rid of the warnings if I turn off everything grpc related in the consul config. However, despite setting "use_streaming_backend" and "rpc.enable_streaming" to false, I'm still getting the warnings. Is there a way to turn off everything grpc related?

roman-vynar commented 2 years ago

Joining this flashmob 1.13.1 and not using grpc at all:

2022-09-05T14:51:22.164Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to 
{us-west-2-10.0.7.132:8300 node2.us-west-2 <nil> 0 <nil>}. Err: connection error: desc = "transport: 
Error while dialing dial tcp <nil>->10.0.7.132:8300: operation was canceled". Reconnecting...
alt-dima commented 2 years ago

Joining too 1.13.2

2022-10-15T07:53:46.287Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {ca-central-1-production-3-10.113.11.204:8300 ip-10-113-11-204.service.consul <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.113.11.204:8300: operation was canceled". Reconnecting...
saintmalik commented 2 years ago

Joining too 1.13.2

2022-10-15T07:53:46.287Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {ca-central-1-production-3-10.113.11.204:8300 ip-10-113-11-204.service.consul <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.113.11.204:8300: operation was canceled". Reconnecting...

also experiencing this

humrobin commented 2 years ago

Me too

consul version 1.13.1 / 1.11.10 / 1.12.3

{"@level":"warn","@message":"[core]grpc: addrConn.createTransport failed to connect to {dc1-192.168.200.16:8300 consul1.dc1 \u003cnil\u003e 0 \u003cnil\u003e}. Err: connection error: desc = \"transport: Error while dialing dial tcp 192.168.200.17:0-\u003e192.168.200.16:8300: operation was canceled\". Reconnecting...","@module":"agent","@timestamp":"2022-10-17T10:43:22.346739+08:00"}

saintmalik commented 2 years ago

it appears both in the 3 consul-server and even the consul-client, this is on EKS Cluster though.

2022-10-16T13:06:06.658Z [INFO]  agent: Joining cluster...: cluster=LAN
2022-10-16T13:06:06.658Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.plo.svc:8301, consul-consul-server-1.consul-consul-server.plo.svc:8301, consul-consul-server-2.consul-consul-server.plo.svc:8301]
2022-10-16T13:06:06.658Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:06.658Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2022-10-16T13:06:06.758Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:06.774Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:07.255Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:07.255Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:36786 error="No known Consul servers"
2022-10-16T13:06:08.780Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
2022-10-16T13:06:08.780Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
        * Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host

"
2022-10-16T13:06:08.780Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error="3 errors occurred:
        * Failed to resolve consul-consul-server-0.consul-consul-server.plo.svc:8301: lookup consul-consul-server-0.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-1.consul-consul-server.plo.svc:8301: lookup consul-consul-server-1.consul-consul-server.plo.svc on 10.100.0.10:53: no such host
        * Failed to resolve consul-consul-server-2.consul-consul-server.plo.svc:8301: lookup consul-consul-server-2.consul-consul-server.plo.svc on 10.100.0.10:53: no such host

"
2022-10-16T13:06:11.963Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:11.963Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:37104 error="No known Consul servers"
2022-10-16T13:06:21.933Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:21.933Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:55054 error="No known Consul servers"
2022-10-16T13:06:24.006Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:24.006Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2022-10-16T13:06:31.955Z [WARN]  agent.router.manager: No servers available
2022-10-16T13:06:31.955Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:54526 error="No known Consul servers"
2022-10-16T13:06:38.781Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.plo.svc:8301, consul-consul-server-1.consul-consul-server.plo.svc:8301, consul-consul-server-2.consul-consul-server.plo.svc:8301]
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-172-31-34-164.us-west-2.compute.internal 172.31.36.203
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-2 172.31.21.131
2022-10-16T13:06:38.789Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-0 172.31.47.142
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-2 (Addr: tcp/172.31.21.131:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: consul-consul-server-1 172.31.50.93
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-0 (Addr: tcp/172.31.47.142:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [INFO]  agent.client: adding server: server="consul-consul-server-1 (Addr: tcp/172.31.50.93:8300) (DC: dc1)"
2022-10-16T13:06:38.790Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.21.131:8300 consul-consul-server-2 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.21.131:8300: operation was canceled". Reconnecting...
2022-10-16T13:06:38.851Z [INFO]  agent: (LAN) joined: number_of_nodes=3
2022-10-16T13:06:38.851Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2022-10-16T13:06:38.946Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-172-31-57-221.us-west-2.compute.internal 172.31.52.120
2022-10-16T13:06:40.148Z [INFO]  agent: Synced node info
2022-10-16T13:15:57.624Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.50.93:8300 consul-consul-server-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.50.93:8300: operation was canceled". Reconnecting...
2022-10-16T13:26:17.412Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.21.131:8300 consul-consul-server-2 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.21.131:8300: operation was canceled". Reconnecting...
2022-10-16T15:20:02.196Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.31.50.93:8300 consul-consul-server-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.31.50.93:8300: operation was canceled". Reconnecting...
rasdark commented 2 years ago

1.13.3 me too

окт 20 22:53:03 pg03 consul[2436375]: 2022-10-20T22:53:03.343+0300 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {consul01-10.161.112.45:8300 consul-srv01 <nil> 0 <nil>}>
окт 20 22:55:21 pg03 consul[2436375]: 2022-10-20T22:55:21.945+0300 [INFO]  agent: Synced check: check=service:pgcluster/pg03
hoanbc commented 2 years ago

same issue with k8s on-premise run version 1.13.3

luboss79 commented 1 year ago

The same for me: 3 node cluster - CentoS 7, consul v 1.13.3

server: Nov 18 10:41:06 risconsul-03 consul: 2022-11-18T10:41:06.341+0100 [WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-172.16.219.54:8300 risconsul-03.dc.local <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.16.219.54:8300: operation was canceled". Reconnecting...

client: Nov 18 12:32:02 murmcs-ft-01 consul[7105]: agent.http: Request error: method=GET url=/v1/catalog/services?wait=2s&index=42023284&token=<hidden> from=127.0.0.1:46842 error="rpc error making call: i/o deadline reached" Nov 18 12:32:02 murmcs-ft-01 risng-mur-ui: com.ecwid.consul.v1.OperationException: OperationException(statusCode=500, statusMessage='Internal Server Error', statusContent='rpc error making call: i/o deadline reached')

luboss79 commented 1 year ago

Please any progress with this issue? Causes difficulty with troubleshooting apps when numerous error messages in log files. Thanks

weastur commented 1 year ago

With all respect, it's quite frustrating to see that kind of issue here for so long.

yangds2016 commented 1 year ago

Still happening in Consul v1.14.1 .


[WARN]  agent: [core][Channel #1 SubChannel #370] grpc: addrConn.createTransport failed to connect to {
"Addr": "dc1-12.2.100.12:8300",
"ServerName": "consul-3.dc1",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp 12.x.x.10:0->12.x.x.12:8300: operation was canceled"
edupr91 commented 1 year ago

Still happening un Consul v1.14.1

Dec 08 09:57:12 s1.mobydick.local consul[70543]: 2022-12-08T09:57:12.918Z [WARN]  agent: [core][Channel #1 SubChannel #71] grpc: addrConn.createTransport failed to connect to {
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "Addr": "dc1-10.xx.xx.12:8300",
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "ServerName": "s1-mobydick-local.dc1",
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "Attributes": null,
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "BalancerAttributes": null,
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "Type": 0,
Dec 08 09:57:12 s1.mobydick.local consul[70543]:   "Metadata": null
Dec 08 09:57:12 s1.mobydick.local consul[70543]: }. Err: connection error: desc = "transport: Error while dialing dial tcp 10.xx.xx.11:0->10.xx.xx.12:8300: operation was canceled"
jlrcontegix commented 1 year ago

We have the same "issue" on v1.14.2.

jkirschner-hashicorp commented 1 year ago

Hi all,

I wanted to acknowledge that we're aware that this issue still exists in the latest versions of Consul, and that it's troublesome to have a WARN appear in your logs that doesn't need attention and may distract from unrelated troubleshooting efforts.

We're still looking into approaches similar to option 2 above. Our first attempt wasn't viable. Unfortunately, this issue is much less straightforward to resolve than it sounds.

We are now assessing a different approach, but it is still an investigation in progress and we don't yet know whether it will be viable. We'll update you here as we learn more about the viability of this different approach.

jlrcontegix commented 1 year ago

Hi all,

I wanted to acknowledge that we're aware that this issue still exists in the latest versions of Consul, and that it's troublesome to have a WARN appear in your logs that doesn't need attention and may distract from unrelated troubleshooting efforts.

We're still looking into approaches similar to option 2 above. Our first attempt wasn't viable. Unfortunately, this issue is much less straightforward to resolve than it sounds.

We are now assessing a different approach, but it is still an investigation in progress and we don't yet know whether it will be viable. We'll update you here as we learn more about the viability of this different approach.

It may be worth mentioning somewhere under https://developer.hashicorp.com/consul/docs/upgrading since the issue has been around a while, and sounds like it may be for some time. We upgraded from 1.9.3 and while we didn't experience any problems these messages gave us quite a bit of panic until this thread was found.

kisunji commented 1 year ago

PR https://github.com/hashicorp/consul/pull/15701 has been merged and should land in the next patch versions for 1.14.x, 1.13.x and 1.12.x. Keeping this issue open for now to wait for feedback from the community.

My PR should fix the periodic WARN logs during server shuffling which occurs every ~2 mins by default.

Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue https://github.com/hashicorp/consul/issues/15821

kong62 commented 1 year ago

same on consul 1.14.2

2023-01-06T05:49:31.453Z [WARN]  agent: [core][Channel #1 SubChannel #1017602] grpc: addrConn.createTransport failed to connect to {
  "Addr": "dc1-192.168.48.74:8300",
  "ServerName": "consul-server-2.dc1",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->192.168.48.74:8300: operation was canceled"
jkirschner-hashicorp commented 1 year ago

Hi @kong62,

The issue is expected to still be present in 1.14.2. It will be fixed as of the next set of patch releases (1.14.4, 1.13.6, and 1.12.9).