hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul streaming leaking rpc connections #17369

Open weichuliu opened 1 year ago

weichuliu commented 1 year ago

Overview of the Issue

Consul client is leaking gRPC connections (1 per several minutes) when serving streaming requests.

On 1.10+, with streaming_backend enabled, Consul client agent will handle blocking queries with streaming -- it maintains a cache and establish a "subscription" on the change using gRPC to talk to Consul servers.

In our deployment, our Consul client agent are constantly getting lot of blocking queries. All subscriptions are supposed to use 1 single gRPC connection. However, we see consul.grpc.client.connections growing over time.

In the TRACE log, we found that the gRPC Router manager is constantly (every several minutes) refreshing gRPC connection, i.e. creating a new gRPC connection to replace the old one.

2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1] Resolver state updated: {
   ...
} ()
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1442] Subchannel created
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1] Channel Connectivity change to IDLE
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1442] addrConn: tryUpdateAddrs curAddr: {
  ...
}
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1440] Subchannel Connectivity change to SHUTDOWN
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1440] Subchannel deleted
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1442] Subchannel Connectivity change to CONNECTING
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1 SubChannel #1442] Subchannel picks a new address "server1:8300" to connect
2023-05-10T06:30:48.552Z [TRACE] agent: [core][Channel #1] Channel Connectivity change to CONNECTING
2023-05-10T06:30:48.553Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: ...
2023-05-10T06:30:48.554Z [TRACE] agent: [core][Channel #1 SubChannel #1442] Subchannel Connectivity change to READY
2023-05-10T06:30:48.554Z [TRACE] agent: [core][Channel #1] Channel Connectivity change to READY

During such event, usually the consul.grpc.client.connections will be x, then x+1 and back to x. Which is the new connection replacing old one.

However, when there are active blocking queries, the old subscriptions will be holding the old gRPC connection, prevent it from closing. so consul.grpc.client.connections is x -> x+1, and keeps growing. As long as the old subscription is still valid, in other word the corresponding cache is not evicted, the old gRPC connection is held open.

In our deployment, a single client agent can easily have more than 300 gRPC connections.


Reproduction Steps

  1. Setup a Consul cluster with 3 servers and 1 client. Nothing special. I tested on 1.14.6 and 1.15.2 and both can reproduce the issue.
  2. Config telemery for Consul client, and monitor consul.grpc.client.connections metrics.
  3. Run consul monitor -log-level trace | grep -i channel on client side, to catch the gRPC rebalance event.
  4. Create 10 fake service on client: for i in $(seq 100); do consul services register -name test$i; done.
  5. Wait for the Subchannel Connectivity change to READY TRACE log, then start a new watch: consul watch -type=service -service=test1 cat.
  6. Repeat step 5. Kick a new consul watch after a gRPC rebalance happened.
  7. Check the telemetry and observe the growth of connections.

Expected behavior

Old gRPC connection should be released and closed, and the overall grpc.client.connections should always be low.

Consul info for both Client and Server

As long as Consul server/client support streaming.

Operating system and Environment details

This bug is OS agnostic.

weichuliu commented 1 year ago

When we stop sending blocking queries to a Consul client, it takes 20 minutes to gradually evict sub-material view caches, and we can see grpc.client.connections gradually drop from 400 to less than 10 during the period.

huikang commented 1 year ago

@weichuliu , thanks for your time investigating this issue and the deep analysis. We will look into this issue ASAP.

weichuliu commented 1 year ago

Thanks @huikang I'm looking forward to the fix.

This one is pretty annoying for us, and I believe it's annoying for anyone that uses streaming feature.

jmurret commented 1 year ago

Hi @weichuliu, I cannot recreate this on the latest consul nor consul 1.15.2. I have created watches after each rebalancing and also waited for the rebalancing to get back to the original server. The grpc.client.connection.count never got over 2 for any server or client, including the client I was running watches from.

I have use the following configuration on kubernetes:

global:
  enabled: true
  name: consul
  datacenter: dc1
  metrics:
    enabled: true
    enableAgentMetrics: true
    agentMetricsRetentionTime: "1m"
  image: hashicorp/consul:1.15.2
  imageK8S: hashicorp/consul-k8s-control-plane:1.1
server:
  replicas: 3
client:
  enabled: true
  extraConfig: |
    {"use_streaming_backend": true}
ui:
  enabled: true
  metrics:
    enabled: true
    provider: "prometheus"
    baseURL: http://prometheus-server.default.svc.cluster.local
connectInject:
  enabled: true
  default: true
controller:
  enabled: true

Could you provide the simplest configuration that you used for the recreation steps?

weichuliu commented 1 year ago

@jmurret

I didn't use k8s for testing so I can't debug your deployment. Here is how I ran on my local (M1 Macbook, but shouldnt matter here).

consul1.hcl:

node_name = "node1"
server         = true
client_addr    = "0.0.0.0"
bind_addr      = "127.0.0.1"

datacenter         = "test"
primary_datacenter = "test"
bootstrap_expect   = 3
encrypt            = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir           = "/tmp/consul-data-1"
rejoin_after_leave = true

acl {
  enabled                  = false
}

ports {
  dns  = -1
  http = 7400
  https = -1
  grpc = 7402
  grpc_tls = 7403
  server = 7340
  serf_lan = 7341
  serf_wan = -1
}

retry_join = [
  "127.0.0.1:7341",
  "127.0.0.1:7351",
  "127.0.0.1:7361",
]

rpc {
  enable_streaming = true
}
use_streaming_backend = true

consul2.hcl:

node_name = "node2"
server         = true
client_addr    = "0.0.0.0"
bind_addr      = "127.0.0.1"

datacenter         = "test"
primary_datacenter = "test"
bootstrap_expect   = 3
encrypt            = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir           = "/tmp/consul-data-2"
rejoin_after_leave = true

acl {
  enabled                  = false
}

ports {
  dns  = -1
  http = 7500
  https = -1
  grpc = 7502
  grpc_tls = 7503
  server = 7350
  serf_lan = 7351
  serf_wan = -1
}

retry_join = [
  "127.0.0.1:7341",
  "127.0.0.1:7351",
  "127.0.0.1:7361",
]

rpc {
  enable_streaming = true
}
use_streaming_backend = true

consul3.hcl:

node_name = "node3"
server         = true
client_addr    = "0.0.0.0"
bind_addr      = "127.0.0.1"

datacenter         = "test"
primary_datacenter = "test"
bootstrap_expect   = 3
encrypt            = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir           = "/tmp/consul-data-3"
rejoin_after_leave = true

acl {
  enabled                  = false
}

ports {
  dns  = -1
  http = 7600
  https = -1
  grpc = 7602
  grpc_tls = 7603
  server = 7360
  serf_lan = 7361
  serf_wan = -1
}

retry_join = [
  "127.0.0.1:7341",
  "127.0.0.1:7351",
  "127.0.0.1:7361",
]

rpc {
  enable_streaming = true
}
use_streaming_backend = true

client.hcl:

node_name = "client"
server         = false
client_addr    = "0.0.0.0"
bind_addr      = "127.0.0.1"

datacenter         = "test"
encrypt            = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir           = "/tmp/consul-data-client"
rejoin_after_leave = true

acl {
  enabled                  = false
}

telemetry {
  disable_hostname = true
  dogstatsd_addr   = "127.0.0.1:8125"
}

retry_join = [
  "127.0.0.1:7341",
  "127.0.0.1:7351",
  "127.0.0.1:7361",
]

log_level            = "TRACE"

rpc {
  enable_streaming = true
}
use_streaming_backend = true

The consul I use:

$ consul version
Consul v1.15.2
Revision 5e08e229
Build Date 2023-03-30T17:51:19Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

To start the cluster, I ran these (each line in a seperate terminal).

consul agent -server -config-file=./consul1.hcl
consul agent -server -config-file=./consul2.hcl
consul agent -server -config-file=./consul3.hcl
consul agent -config-file=./client.hcl

To catch the client metrics on UDP 8125, I personally uses datadog-mock.

datadog-mock $ cd src/datadog-mock
$ go build
$ ./datadog-mock | grep grpc.client.connections

I then registered 100 services to the client.

Note that the Consul client listens to default localhost:8500, so consul command talks directly to it.

for i in $(seq 100); do consul services register -name test$i; done

The last step, I ran this Python script (python3 many-consul-watches.py) to create a watch every 20s

import time
import signal
import subprocess
import random

consul_subprocesses = []

def signal_handler(sig, frame):
    for p in consul_subprocesses:
        p.send_signal(signal.SIGINT)
    time.sleep(1)
    exit()
signal.signal(signal.SIGINT, signal_handler)

while True:
    time.sleep(20)
    print(f'{time.ctime()} new watch')
    p = subprocess.Popen(['consul', 'watch', '-type=service', f'-service=test{random.randint(1,100)}', 'cat'])
    consul_subprocesses.append(p)

After 20 minutes, here is the metrics

image

weichuliu commented 1 year ago

I admit that I tried to follow the steps I wrote in the beginning, and couldn't reproduce (the connections at most hit 3), I forgot what step I missed there.

However, with the brutal spawning new consul watch every 20 seconds + many services + let the thing run for a while, looks like the issue is reproduce-able on 1.15.2.

weichuliu commented 1 year ago

Tested on 1.16.0 and can reproduce. Though it takes longer.

image

weichuliu commented 1 year ago

@jmurret It's been a while since you asked for reproducing the issue.

I wonder if you are able to reproduce this with the steps?