liftbridge-io / go-liftbridge

Go client for Liftbridge. https://github.com/liftbridge-io/liftbridge
Apache License 2.0
66 stars 18 forks source link

Fetching metadata on subscribe causing high memory on client and server #100

Open cruickshankpg opened 3 years ago

cruickshankpg commented 3 years ago

When subscribing to a stream, if the stream is not in the metadataCache the cache is completely updated resulting in the FetchMedata RPC being called. This can happen if the stream is new or does not exist.

https://github.com/liftbridge-io/go-liftbridge/blob/c842e2f19749a5aa5a7664b1a6d4a366034d8f92/v2/client.go#L1518

We only create a stream when a message is published to save on unnecessary creates but this means that if multiple subscribers attempt to subscribe before the publisher then a lot of FetchMetadata RPCs are made. When there are 1000s (I had 3000 when I hit this) of streams in the liftbridge cluster the metadata gets very big and marshalling it so frequently caused one of my liftbridge servers to become unresponsive and all the memory on my client to be used up.

Our liftbridge client service has liftbridge client connections to multiple liftbridge clusters so even storing the full metadata for each cluster is more memory than we would like. Keeping track of the brokers for a cluster is obviously necessary but do we need all the streams? Could individual stream partitions be fetched into the cache on demand?

cruickshankpg commented 3 years ago

I got some allocs_space pprofs to work out why my servers were broken:

liftbridge allocs profile

      File: liftbridge
Type: alloc_space
Time: Dec 8, 2020 at 12:36pm (GMT)
Showing nodes accounting for 80565.66MB, 100% of 80565.66MB total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                        16492.49MB   100% |   google.golang.org/grpc/encoding/proto.codec.Marshal /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/encoding/proto/proto.go:70
16492.49MB 20.47% 20.47% 16492.49MB 20.47%                | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Marshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:4250
----------------------------------------------------------+-------------
                                         9612.19MB   100% |   syscall.BytePtrFromString /src/toolchain/_go/1.15.3/go/src/syscall/syscall.go:69
 9612.19MB 11.93% 32.40%  9612.19MB 11.93%                | syscall.ByteSliceFromString /src/toolchain/_go/1.15.3/go/src/syscall/syscall.go:53
----------------------------------------------------------+-------------
                                         7251.11MB   100% |   github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:277
 7251.11MB  9.00% 41.40%  7251.11MB  9.00%                | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1337
----------------------------------------------------------+-------------
                                         2230.10MB 33.76% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1348
                                         2195.60MB 33.24% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1347
                                         2180.10MB 33.00% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1346
 6605.80MB  8.20% 49.60%  6605.80MB  8.20%                | github.com/liftbridge-io/liftbridge/server.eventTimestampsToProto /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1320
----------------------------------------------------------+-------------
                                        20480.48MB   100% |   github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
 5148.55MB  6.39% 55.99% 20480.48MB 25.42%                | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:277
                                         7251.11MB 35.40% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1337
                                         2230.10MB 10.89% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1348
                                         2195.60MB 10.72% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1347
                                         2180.10MB 10.64% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1346
                                          758.01MB  3.70% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1340
                                          717.01MB  3.50% |   github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1341
----------------------------------------------------------+-------------
                                         4391.40MB   100% |   github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
 4391.40MB  5.45% 61.44%  4391.40MB  5.45%                | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:279
----------------------------------------------------------+-------------
                                         3173.56MB   100% |   strings.(*Builder).Grow /src/toolchain/_go/1.15.3/go/src/strings/builder.go:82 (inline)
 3173.56MB  3.94% 65.38%  3173.56MB  3.94%                | strings.(*Builder).grow /src/toolchain/_go/1.15.3/go/src/strings/builder.go:68
----------------------------------------------------------+-------------
                                         2706.48MB   100% |   github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
 2706.48MB  3.36% 68.74%  2706.48MB  3.36%                | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:260
----------------------------------------------------------+-------------
                                         2212.10MB   100% |   github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
 2212.10MB  2.75% 71.49%  2212.10MB  2.75%                | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:275

client profile:

      File: ems-coordinator
Type: alloc_space
Time: Dec 7, 2020 at 6:17pm (GMT)
Showing nodes accounting for 118732.17MB, 100% of 118732.17MB total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                        25840.43MB   100% |   google.golang.org/grpc.recvAndDecompress /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/rpc_util.go:689
25840.43MB 21.76% 21.76% 25840.43MB 21.76%                | google.golang.org/grpc.(*parser).recvMsg /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/rpc_util.go:576
----------------------------------------------------------+-------------
                                        15631.98MB 79.33% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
                                         4013.77MB 20.37% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
                                           58.51MB   0.3% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
19704.26MB 16.60% 38.36% 19704.26MB 16.60%                | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:294
----------------------------------------------------------+-------------
                                         8983.87MB   100% |   github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
 8983.87MB  7.57% 45.93%  8983.87MB  7.57%                | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10692
----------------------------------------------------------+-------------
                                         7240.88MB   100% |   github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
 7240.88MB  6.10% 52.02%  7240.88MB  6.10%                | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10578
----------------------------------------------------------+-------------
                                         7120.87MB   100% |   github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
 7120.87MB  6.00% 58.02%  7120.87MB  6.00%                | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10546
----------------------------------------------------------+-------------
                                         7089.10MB   100% |   google.golang.org/grpc/encoding/proto.codec.Unmarshal /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/encoding/proto/proto.go:88
 7089.10MB  5.97% 63.99%  7089.10MB  5.97%                | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8603
----------------------------------------------------------+-------------
                                         6355.18MB   100% |   github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
 6355.18MB  5.35% 69.34%  6355.18MB  5.35%                | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10712
----------------------------------------------------------+-------------
                                         4334.02MB 79.97% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
                                         1071.64MB 19.77% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
                                           13.81MB  0.25% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
 5419.47MB  4.56% 73.91%  5419.47MB  4.56%                | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:303
----------------------------------------------------------+-------------
                                         2789.17MB 79.04% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
                                          724.54MB 20.53% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
                                              15MB  0.43% |   github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
 3528.72MB  2.97% 76.88%  3528.72MB  2.97%                | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:279
tylertreat commented 3 years ago

Keeping track of the brokers for a cluster is obviously necessary but do we need all the streams? Could individual stream partitions be fetched into the cache on demand?

Yes, this is an area for improvement I've had in mind. The client should only fetch the streams it needs. Also, the FetchMetadata RPC already supports this. It just defaults to fetching everything if no streams are specified, so it should be a fairly simple change.