Open weichuliu opened 1 year ago
When we stop sending blocking queries to a Consul client, it takes 20 minutes to gradually evict sub-material view caches, and we can see grpc.client.connections
gradually drop from 400 to less than 10 during the period.
@weichuliu , thanks for your time investigating this issue and the deep analysis. We will look into this issue ASAP.
Thanks @huikang I'm looking forward to the fix.
This one is pretty annoying for us, and I believe it's annoying for anyone that uses streaming feature.
Hi @weichuliu, I cannot recreate this on the latest consul nor consul 1.15.2. I have created watches after each rebalancing and also waited for the rebalancing to get back to the original server. The grpc.client.connection.count
never got over 2 for any server or client, including the client I was running watches from.
I have use the following configuration on kubernetes:
global:
enabled: true
name: consul
datacenter: dc1
metrics:
enabled: true
enableAgentMetrics: true
agentMetricsRetentionTime: "1m"
image: hashicorp/consul:1.15.2
imageK8S: hashicorp/consul-k8s-control-plane:1.1
server:
replicas: 3
client:
enabled: true
extraConfig: |
{"use_streaming_backend": true}
ui:
enabled: true
metrics:
enabled: true
provider: "prometheus"
baseURL: http://prometheus-server.default.svc.cluster.local
connectInject:
enabled: true
default: true
controller:
enabled: true
Could you provide the simplest configuration that you used for the recreation steps?
@jmurret
I didn't use k8s for testing so I can't debug your deployment. Here is how I ran on my local (M1 Macbook, but shouldnt matter here).
consul1.hcl
:
node_name = "node1"
server = true
client_addr = "0.0.0.0"
bind_addr = "127.0.0.1"
datacenter = "test"
primary_datacenter = "test"
bootstrap_expect = 3
encrypt = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir = "/tmp/consul-data-1"
rejoin_after_leave = true
acl {
enabled = false
}
ports {
dns = -1
http = 7400
https = -1
grpc = 7402
grpc_tls = 7403
server = 7340
serf_lan = 7341
serf_wan = -1
}
retry_join = [
"127.0.0.1:7341",
"127.0.0.1:7351",
"127.0.0.1:7361",
]
rpc {
enable_streaming = true
}
use_streaming_backend = true
consul2.hcl
:
node_name = "node2"
server = true
client_addr = "0.0.0.0"
bind_addr = "127.0.0.1"
datacenter = "test"
primary_datacenter = "test"
bootstrap_expect = 3
encrypt = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir = "/tmp/consul-data-2"
rejoin_after_leave = true
acl {
enabled = false
}
ports {
dns = -1
http = 7500
https = -1
grpc = 7502
grpc_tls = 7503
server = 7350
serf_lan = 7351
serf_wan = -1
}
retry_join = [
"127.0.0.1:7341",
"127.0.0.1:7351",
"127.0.0.1:7361",
]
rpc {
enable_streaming = true
}
use_streaming_backend = true
consul3.hcl
:
node_name = "node3"
server = true
client_addr = "0.0.0.0"
bind_addr = "127.0.0.1"
datacenter = "test"
primary_datacenter = "test"
bootstrap_expect = 3
encrypt = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir = "/tmp/consul-data-3"
rejoin_after_leave = true
acl {
enabled = false
}
ports {
dns = -1
http = 7600
https = -1
grpc = 7602
grpc_tls = 7603
server = 7360
serf_lan = 7361
serf_wan = -1
}
retry_join = [
"127.0.0.1:7341",
"127.0.0.1:7351",
"127.0.0.1:7361",
]
rpc {
enable_streaming = true
}
use_streaming_backend = true
client.hcl
:
node_name = "client"
server = false
client_addr = "0.0.0.0"
bind_addr = "127.0.0.1"
datacenter = "test"
encrypt = "pEqoFn2jJjX1wLxkSbzdD+j8IccoRW9Cwn3uUA9Wfqs="
data_dir = "/tmp/consul-data-client"
rejoin_after_leave = true
acl {
enabled = false
}
telemetry {
disable_hostname = true
dogstatsd_addr = "127.0.0.1:8125"
}
retry_join = [
"127.0.0.1:7341",
"127.0.0.1:7351",
"127.0.0.1:7361",
]
log_level = "TRACE"
rpc {
enable_streaming = true
}
use_streaming_backend = true
The consul I use:
$ consul version
Consul v1.15.2
Revision 5e08e229
Build Date 2023-03-30T17:51:19Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
To start the cluster, I ran these (each line in a seperate terminal).
consul agent -server -config-file=./consul1.hcl
consul agent -server -config-file=./consul2.hcl
consul agent -server -config-file=./consul3.hcl
consul agent -config-file=./client.hcl
To catch the client metrics on UDP 8125, I personally uses datadog-mock.
datadog-mock $ cd src/datadog-mock
$ go build
$ ./datadog-mock | grep grpc.client.connections
I then registered 100 services to the client.
Note that the Consul client listens to default localhost:8500, so consul
command talks directly to it.
for i in $(seq 100); do consul services register -name test$i; done
The last step, I ran this Python script (python3 many-consul-watches.py
) to create a watch every 20s
import time
import signal
import subprocess
import random
consul_subprocesses = []
def signal_handler(sig, frame):
for p in consul_subprocesses:
p.send_signal(signal.SIGINT)
time.sleep(1)
exit()
signal.signal(signal.SIGINT, signal_handler)
while True:
time.sleep(20)
print(f'{time.ctime()} new watch')
p = subprocess.Popen(['consul', 'watch', '-type=service', f'-service=test{random.randint(1,100)}', 'cat'])
consul_subprocesses.append(p)
After 20 minutes, here is the metrics
I admit that I tried to follow the steps I wrote in the beginning, and couldn't reproduce (the connections at most hit 3), I forgot what step I missed there.
However, with the brutal spawning new consul watch every 20 seconds + many services + let the thing run for a while, looks like the issue is reproduce-able on 1.15.2.
Tested on 1.16.0 and can reproduce. Though it takes longer.
@jmurret It's been a while since you asked for reproducing the issue.
I wonder if you are able to reproduce this with the steps?
Overview of the Issue
Consul client is leaking gRPC connections (1 per several minutes) when serving streaming requests.
On 1.10+, with streaming_backend enabled, Consul client agent will handle blocking queries with streaming -- it maintains a cache and establish a "subscription" on the change using gRPC to talk to Consul servers.
In our deployment, our Consul client agent are constantly getting lot of blocking queries. All subscriptions are supposed to use 1 single gRPC connection. However, we see
consul.grpc.client.connections
growing over time.In the TRACE log, we found that the gRPC Router manager is constantly (every several minutes) refreshing gRPC connection, i.e. creating a new gRPC connection to replace the old one.
During such event, usually the
consul.grpc.client.connections
will bex
, thenx+1
and back tox
. Which is the new connection replacing old one.However, when there are active blocking queries, the old subscriptions will be holding the old gRPC connection, prevent it from closing. so
consul.grpc.client.connections
isx
->x+1
, and keeps growing. As long as the old subscription is still valid, in other word the corresponding cache is not evicted, the old gRPC connection is held open.In our deployment, a single client agent can easily have more than 300 gRPC connections.
Reproduction Steps
1.14.6
and1.15.2
and both can reproduce the issue.telemery
for Consul client, and monitorconsul.grpc.client.connections
metrics.consul monitor -log-level trace | grep -i channel
on client side, to catch the gRPC rebalance event.for i in $(seq 100); do consul services register -name test$i; done
.Subchannel Connectivity change to READY
TRACE log, then start a new watch:consul watch -type=service -service=test1 cat
.consul watch
after a gRPC rebalance happened.Expected behavior
Old gRPC connection should be released and closed, and the overall
grpc.client.connections
should always be low.Consul info for both Client and Server
As long as Consul server/client support streaming.
Operating system and Environment details
This bug is OS agnostic.