Open fredwangwang opened 2 years ago
Hey @fredwangwang
Thanks for bringing this to our attention & for providing the heap dumps without us asking ( that's always helpful 😅 ) . We'll look into and keep this issue updated when we learn more.
Thanks again for the report @fredwangwang 🙇🏻♂️
I think we've tracked this down to a bug in gRPC that's fixed by https://github.com/grpc/grpc-go/pull/4347 (available in v1.37.1 onwards).
Unfortunately, we're not yet able to upgrade to this version because of a related dependency (see: https://github.com/hashicorp/consul/issues/10471) — but we're working on making the upgrade possible!
@boxofrad Thanks for the update! Hope this can be resolved soon 🙂
hi @boxofrad are there any updates for this? Tracking down a memory leak in our environment and I think it may also be related to this. I see the dependency you mentioned shows as closed. Checked latest version and see grpc-go 1.36.0 referenced. Thanks.
We are seeing Consul agent leaks as well on 1.11.5. I'm going to try to gather more specific information but we have definitely identified the leak as coming from the consul agent.
The following is a high-level Memory Utilization graph of a few of our services (sorry, it's from Cloudwatch).
Unfortunately I haven't been in a position to generate a heap dump, I'm going to try to do that. The only observations I can say are the terminator has only a few services registered to it and few or no healthchecks. While the "vhost" services have a number of upstreams configured via Connect and actively validate them. The red one, vhost-proxy-c, having the largest set.
I did look into generating a dev build based on 1.12.0 that used grpc/grpc-go@v1.37.1 but the effect of updating that appeard to spiral out pretty badly and it is not clear how much I'm messing that up (envoy grpc updates, deprecated modules in go-discover, etc). If there's a dev build to test, please let me know I'd be happy to try testing it where I can.
So that was indeed operator error in updating the dependency. I wasn't able to update 1.11 to use grpc/grpc-go@v1.37.1 as it triggered a cascade into go-discover. I've deployed this development build to our dev stack to get some wider burn-in, but I'll be submitting a PR with the changes today.
Unfortunately for me, my EC2 instances are stuck on 1.11 and I'm not sure how to backport this or if it's feasible.
Overview of the Issue
Noticed a suspected memory leak in our clusters
After looking into the heap dump (using
consul debug
) from two nodes with different boot time (several hours vs ~ a month), here is one part we noticed big difference:It looks like the underlying grpc lb is not releasing the stale connections.
Reproduction Steps
The real way to reproduce is to let the node run for a long time, and notice the memory usage increase.
But here is a test case trying to reproduce the issue:
added to agent/grpc/client_test.go
```go func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000(t *testing.T) { count := 5 res := resolver.NewServerResolverBuilder(newConfig(t)) registerWithGRPC(t, res) pool := NewClientConnPool(ClientConnPoolConfig{ Servers: res, UseTLSForDC: useTLSForDcAlwaysTrue, DialingFromServer: true, DialingFromDatacenter: "dc1", }) for i := 0; i < count; i++ { name := fmt.Sprintf("server-%d", i) srv := newTestServer(t, name, "dc1", nil) res.AddServer(srv.Metadata()) t.Cleanup(srv.shutdown) } conn, err := pool.ClientConn("dc1") require.NoError(t, err) client := testservice.NewSimpleClient(conn) ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) t.Cleanup(cancel) _, err = client.Something(ctx, &testservice.Req{}) require.NoError(t, err) t.Run("rebalance the dc", func(t *testing.T) { // Rebalance is random, but if we repeat it a few times it should give us a // new server. attempts := 5000 for i := 0; i < attempts; i++ { res.NewRebalancer("dc1")() _, err := client.Something(ctx, &testservice.Req{}) require.NoError(t, err) } t.Log("*** number of connections in mem:", reflect.ValueOf(client).Elem(). FieldByName("cc").Elem(). FieldByName("conns").Len(), "***") time.Sleep(10 * time.Second) }) } ``` Output: ``` === RUN TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000/rebalance_the_dc client_test.go:345: *** number of connections in mem: 7946 *** --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000 (11.87s) ```cd agent/grpc && go test -c && ./grpc.test -test.v -test.run '^TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000$'
Change the
attempts
from 1 to 5000 increase memory of the testing program from 4mb to 28mb. 5000 is chosen to simulate the number of rebalances in 10 days ~ (60m 24h 10d / 2.5m), where 2.5m is the expected rebalance intervalConsul info for both Client and Server
v1.10.3 linux amd64