memory leak in grpc rebalancer

fredwangwang commented 2 years ago

Overview of the Issue

Noticed a suspected memory leak in our clusters

After looking into the heap dump (using consul debug) from two nodes with different boot time (several hours vs ~ a month), here is one part we noticed big difference:

It looks like the underlying grpc lb is not releasing the stale connections.

Reproduction Steps

The real way to reproduce is to let the node run for a long time, and notice the memory usage increase.

But here is a test case trying to reproduce the issue:

added to agent/grpc/client_test.go

```go func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000(t *testing.T) { count := 5 res := resolver.NewServerResolverBuilder(newConfig(t)) registerWithGRPC(t, res) pool := NewClientConnPool(ClientConnPoolConfig{ Servers: res, UseTLSForDC: useTLSForDcAlwaysTrue, DialingFromServer: true, DialingFromDatacenter: "dc1", }) for i := 0; i < count; i++ { name := fmt.Sprintf("server-%d", i) srv := newTestServer(t, name, "dc1", nil) res.AddServer(srv.Metadata()) t.Cleanup(srv.shutdown) } conn, err := pool.ClientConn("dc1") require.NoError(t, err) client := testservice.NewSimpleClient(conn) ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) t.Cleanup(cancel) _, err = client.Something(ctx, &testservice.Req{}) require.NoError(t, err) t.Run("rebalance the dc", func(t *testing.T) { // Rebalance is random, but if we repeat it a few times it should give us a // new server. attempts := 5000 for i := 0; i < attempts; i++ { res.NewRebalancer("dc1")() _, err := client.Something(ctx, &testservice.Req{}) require.NoError(t, err) } t.Log("*** number of connections in mem:", reflect.ValueOf(client).Elem(). FieldByName("cc").Elem(). FieldByName("conns").Len(), "***") time.Sleep(10 * time.Second) }) } ``` Output: ``` === RUN TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000/rebalance_the_dc client_test.go:345: *** number of connections in mem: 7946 *** --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000 (11.87s) ```

cd agent/grpc && go test -c && ./grpc.test -test.v -test.run '^TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000$'

Change the attempts from 1 to 5000 increase memory of the testing program from 4mb to 28mb. 5000 is chosen to simulate the number of rebalances in 10 days ~ (60m 24h 10d / 2.5m), where 2.5m is the expected rebalance interval

Consul info for both Client and Server

v1.10.3 linux amd64

Amier3 commented 2 years ago

Hey @fredwangwang

Thanks for bringing this to our attention & for providing the heap dumps without us asking ( that's always helpful 😅 ) . We'll look into and keep this issue updated when we learn more.

boxofrad commented 2 years ago

Thanks again for the report @fredwangwang 🙇🏻‍♂️

I think we've tracked this down to a bug in gRPC that's fixed by https://github.com/grpc/grpc-go/pull/4347 (available in v1.37.1 onwards).

Unfortunately, we're not yet able to upgrade to this version because of a related dependency (see: https://github.com/hashicorp/consul/issues/10471) — but we're working on making the upgrade possible!

fredwangwang commented 2 years ago

@boxofrad Thanks for the update! Hope this can be resolved soon 🙂

mochery commented 2 years ago

hi @boxofrad are there any updates for this? Tracking down a memory leak in our environment and I think it may also be related to this. I see the dependency you mentioned shows as closed. Checked latest version and see grpc-go 1.36.0 referenced. Thanks.

dekimsey commented 2 years ago

We are seeing Consul agent leaks as well on 1.11.5. I'm going to try to gather more specific information but we have definitely identified the leak as coming from the consul agent.

The following is a high-level Memory Utilization graph of a few of our services (sorry, it's from Cloudwatch).

Unfortunately I haven't been in a position to generate a heap dump, I'm going to try to do that. The only observations I can say are the terminator has only a few services registered to it and few or no healthchecks. While the "vhost" services have a number of upstreams configured via Connect and actively validate them. The red one, vhost-proxy-c, having the largest set.

I did look into generating a dev build based on 1.12.0 that used grpc/grpc-go@v1.37.1 but the effect of updating that appeard to spiral out pretty badly and it is not clear how much I'm messing that up (envoy grpc updates, deprecated modules in go-discover, etc). If there's a dev build to test, please let me know I'd be happy to try testing it where I can.

dekimsey commented 2 years ago

So that was indeed operator error in updating the dependency. I wasn't able to update 1.11 to use grpc/grpc-go@v1.37.1 as it triggered a cascade into go-discover. I've deployed this development build to our dev stack to get some wider burn-in, but I'll be submitting a PR with the changes today.

Unfortunately for me, my EC2 instances are stuck on 1.11 and I'm not sure how to backport this or if it's feasible.

hashicorp / consul