hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

Closed shellfu closed 1 year ago

shellfu commented 3 years ago

Note from @lkysow: I'm moving this to hashicorp/consul because the discuss post shows a user on EC2 also saw this error.

Overview of the Issue

New 1.10.0 on New K8s Cluster results in [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.65.16:8300 0 consul-server-2.primary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.65.16:8300: operation was canceled". Reconnecting...

These WARNS appear in both the server and clients.

Reproduction Steps

  1. When running helm install with the following values.yml:
    client:
    enabled: true
    grpc: true
    connectInject:
    aclBindingRuleSelector: serviceaccount.name!=default
    default: false
    enabled: true
    metrics:
    defaultEnableMerging: true
    defaultEnabled: true
    defaultMergedMetricsPort: 20100
    defaultPrometheusScrapePath: /metrics
    defaultPrometheusScrapePort: 20200
    transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
    controller:
    enabled: true
    dns:
    enabled: true
    global:
    acls:
    createReplicationToken: true
    manageSystemACLs: true
    datacenter: primary
    enabled: true
    federation:
    createFederationSecret: true
    enabled: true
    gossipEncryption:
    secretKey: key
    secretName: consul-gossip-encryption-key
    image: hashicorp/consul:1.10.0
    imageEnvoy: envoyproxy/envoy-alpine:v1.18.3
    imageK8S: hashicorp/consul-k8s:0.26.0
    logJSON: true
    metrics:
    agentMetricsRetentionTime: 1m
    enableAgentMetrics: false
    enableGatewayMetrics: true
    enabled: true
    name: consul
    tls:
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs:
    - '*.consul'
    - '*.svc.cluster.local'
    - '*.my.customdomain.com'
    verify: false
    meshGateway:
    enabled: true
    service:
    enabled: true
    port: 443
    type: LoadBalancer
    wanAddress:
    port: 443
    source: Service
    server:
    bootstrapExpect: 5
    connect: true
    disruptionBudget:
    enabled: true
    maxUnavailable: 2
    enabled: true
    extraConfig: "{\n  \"primary_datacenter\": \"primary\",\n  \"performance\": {\n
    \     \"raft_multiplier\": 3\n  },\n  \"dns_config\": {\n    \"allow_stale\":
    true,\n    \"cache_max_age\": \"10s\",\n    \"enable_additional_node_meta_txt\":
    false,\n    \"node_ttl\": \"1m\",\n    \"soa\": {\n        \"expire\": 86400,
    \n        \"min_ttl\": 30,\n        \"refresh\": 3600,\n        \"retry\": 600\n
    \   },\n    \"use_cache\": true\n}}"
    replicas: 5
    resources:
    limits:
      cpu: 500m
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 10Gi
    storage: 10Gi
    updatePartition: 0
    syncCatalog:
    default: true
    enabled: true
    nodePortSyncType: ExternalFirst
    syncClusterIPServices: true
    toConsul: true
    toK8S: true
    ui:
    enabled: true
    metrics:
    baseURL: http://mon-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local
    enabled: true
    provider: prometheus
    service:
    enabled: true
    type: NodePort

Expected behavior

WARNS should not be flooding the log and connections should be over 8301 not 8300

Environment details

If not already included, please provide the following:

Additional Context

It seems others are experiencing the same problem. https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237

kong62 commented 1 year ago

Hi @kong62,

The issue is expected to still be present in 1.14.2. It will be fixed as of the next set of patch releases (1.14.4, 1.13.6, and 1.12.9).

thank you

chrisvanmeer commented 1 year ago

Can confirm that after upgrading to v1.14.4 I did not experience these messages during normal operation for one whole day now. Will continue monitoring.

david-yu commented 1 year ago

Thanks will go ahead and close this issue as it looks like this is fixed for 1.14.4 and later. Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue https://github.com/hashicorp/consul/issues/15821

urosgruber commented 1 year ago

I've installed 1.12.9 and I can still see this messages. Not that frequent but still

2023-02-06T05:50:55.844+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.5:8300 consul-3 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.5:8300: operation was canceled". Reconnecting...
2023-02-06T05:57:43.781+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-06T05:58:10.466+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T06:50:09.474+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:13:16.970+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:18:54.830+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:36:53.408+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.5:8300 consul-3 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.5:8300: operation was canceled". Reconnecting...
webant commented 1 year ago

v1.14.4 This one is still reproducible: agent: [core][Channel #1 SubChannel #21] grpc: addrConn.createTransport failed to connect to { "Addr": "xx.xx.xx.xx", "ServerName": "xx.xx.xx.xx", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial tcp ->xx.xx.xx.xx:8300: operation was canceled"

marianatkach commented 1 year ago

I have same problem on my Consul Cluster v1.14.4

[WARN] agent: [core][Channel #1 SubChannel #15] grpc: addrConn.createTransport failed to connect to { "Addr": "xx.xx.xx.xx", "ServerName": "xx.xx", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial tcp ->xx.xx.xx.xx: operation was canceled"

kisunji commented 1 year ago

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. https://github.com/hashicorp/consul/issues/15821 for reference

marianatkach commented 1 year ago

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. https://github.com/hashicorp/consul/issues/15821 for reference

Yep, i have this messages from time to time and i can't fix this. Also, i don't know anything about this problem...

nhenneaux commented 1 year ago

It appears every 2-3 days for each agent. Not related to any startup, quite constant flow of those with several hundreds of agents with ~100 of those errors every day.

urosgruber commented 1 year ago

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. #15821 for reference

With 3 servers, no agents connected, fresh install about 10 messages a day on random.

kisunji commented 1 year ago

Could you provide log fragments with lines before and after you see these WARNs?

I'd like to confirm that they are not caused by this issue (server shuffling every ~2 mins) and would like to move the investigation to either #15821 or a new issue if necessary.

urosgruber commented 1 year ago

Do you need INFO level? I believe there is nothing else in case I set it to WARN

kisunji commented 1 year ago

Do you need INFO level? I believe there is nothing else in case I set it to WARN

Yes, INFO would be appreciated.

urosgruber commented 1 year ago
2023-02-22T11:14:22.693+0100 [INFO]  agent: (LAN) joining: lan_addresses=[10.192.0.3, 10.192.0.4, 10.192.0.5]
2023-02-22T11:14:22.698+0100 [INFO]  agent: started state syncer
2023-02-22T11:14:22.699+0100 [INFO]  agent: Consul agent running!
2023-02-22T11:14:22.699+0100 [INFO]  agent: (LAN) joined: number_of_nodes=3
2023-02-22T11:14:22.699+0100 [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2023-02-22T11:14:22.720+0100 [WARN]  agent.server.raft: failed to get previous log: previous-index=48302 last-index=48300 error="log not found"
2023-02-22T11:14:23.024+0100 [INFO]  agent: Synced node info
2023-02-22T11:16:22.675+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T12:21:01.609+0100 [WARN]  agent.server.raft: skipping application of old log: index=48795
2023-02-22T12:21:47.875+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T13:09:58.024+0100 [INFO]  agent.fsm: snapshot created: duration=236.985µs
2023-02-22T13:09:58.025+0100 [INFO]  agent.server.raft: starting snapshot up to: index=49157
2023-02-22T13:09:58.025+0100 [INFO]  agent.server.snapshot: creating new snapshot: path=/var/db/consul/raft/snapshots/2-49157-1677067798025.tmp
2023-02-22T13:09:58.030+0100 [INFO]  agent.server.snapshot: reaping snapshot: path=/var/db/consul/raft/snapshots/2-16387-1676800266639
2023-02-22T13:09:58.031+0100 [INFO]  agent.server.raft: compacting logs: from=22534 to=38917
2023-02-22T13:09:58.056+0100 [INFO]  agent.server.raft: snapshot complete up to: index=49157
2023-02-22T13:46:17.188+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-22T14:15:21.288+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T15:04:57.369+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
nhenneaux commented 1 year ago

From my side, no info logs around those for the whole cluster. A single agent impacted at a time.

ikonia commented 1 year ago

can we confirm if this is fixed, it's marked as fixed but there are still reports in this thread of the problem

kisunji commented 1 year ago

Hello everyone,

I created issue #17842 to track the problem which I consider separate from this one. Although they are related, it is helpful for us to triage the issue separately given that the WARN logs are no longer frequent or regular.

If you are still seeing WARN logs ending with operation was canceled, please help us investigate by providing DEBUG logs in #17842