Open cruickshankpg opened 3 years ago
I got some allocs_space pprofs to work out why my servers were broken:
liftbridge allocs profile
File: liftbridge
Type: alloc_space
Time: Dec 8, 2020 at 12:36pm (GMT)
Showing nodes accounting for 80565.66MB, 100% of 80565.66MB total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
16492.49MB 100% | google.golang.org/grpc/encoding/proto.codec.Marshal /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/encoding/proto/proto.go:70
16492.49MB 20.47% 20.47% 16492.49MB 20.47% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Marshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:4250
----------------------------------------------------------+-------------
9612.19MB 100% | syscall.BytePtrFromString /src/toolchain/_go/1.15.3/go/src/syscall/syscall.go:69
9612.19MB 11.93% 32.40% 9612.19MB 11.93% | syscall.ByteSliceFromString /src/toolchain/_go/1.15.3/go/src/syscall/syscall.go:53
----------------------------------------------------------+-------------
7251.11MB 100% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:277
7251.11MB 9.00% 41.40% 7251.11MB 9.00% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1337
----------------------------------------------------------+-------------
2230.10MB 33.76% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1348
2195.60MB 33.24% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1347
2180.10MB 33.00% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1346
6605.80MB 8.20% 49.60% 6605.80MB 8.20% | github.com/liftbridge-io/liftbridge/server.eventTimestampsToProto /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1320
----------------------------------------------------------+-------------
20480.48MB 100% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
5148.55MB 6.39% 55.99% 20480.48MB 25.42% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:277
7251.11MB 35.40% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1337
2230.10MB 10.89% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1348
2195.60MB 10.72% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1347
2180.10MB 10.64% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1346
758.01MB 3.70% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1340
717.01MB 3.50% | github.com/liftbridge-io/liftbridge/server.getPartitionMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:1341
----------------------------------------------------------+-------------
4391.40MB 100% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
4391.40MB 5.45% 61.44% 4391.40MB 5.45% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:279
----------------------------------------------------------+-------------
3173.56MB 100% | strings.(*Builder).Grow /src/toolchain/_go/1.15.3/go/src/strings/builder.go:82 (inline)
3173.56MB 3.94% 65.38% 3173.56MB 3.94% | strings.(*Builder).grow /src/toolchain/_go/1.15.3/go/src/strings/builder.go:68
----------------------------------------------------------+-------------
2706.48MB 100% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
2706.48MB 3.36% 68.74% 2706.48MB 3.36% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:260
----------------------------------------------------------+-------------
2212.10MB 100% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:123
2212.10MB 2.75% 71.49% 2212.10MB 2.75% | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).createMetadataResponse /root/go/pkg/mod/github.com/liftbridge-io/liftbridge@v1.3.1-0.20201103213614-27994688a544/server/metadata.go:275
client profile:
File: ems-coordinator
Type: alloc_space
Time: Dec 7, 2020 at 6:17pm (GMT)
Showing nodes accounting for 118732.17MB, 100% of 118732.17MB total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
25840.43MB 100% | google.golang.org/grpc.recvAndDecompress /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/rpc_util.go:689
25840.43MB 21.76% 21.76% 25840.43MB 21.76% | google.golang.org/grpc.(*parser).recvMsg /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/rpc_util.go:576
----------------------------------------------------------+-------------
15631.98MB 79.33% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
4013.77MB 20.37% | github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
58.51MB 0.3% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
19704.26MB 16.60% 38.36% 19704.26MB 16.60% | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:294
----------------------------------------------------------+-------------
8983.87MB 100% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
8983.87MB 7.57% 45.93% 8983.87MB 7.57% | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10692
----------------------------------------------------------+-------------
7240.88MB 100% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
7240.88MB 6.10% 52.02% 7240.88MB 6.10% | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10578
----------------------------------------------------------+-------------
7120.87MB 100% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
7120.87MB 6.00% 58.02% 7120.87MB 6.00% | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10546
----------------------------------------------------------+-------------
7089.10MB 100% | google.golang.org/grpc/encoding/proto.codec.Unmarshal /root/go/pkg/mod/google.golang.org/grpc@v1.33.1/encoding/proto/proto.go:88
7089.10MB 5.97% 63.99% 7089.10MB 5.97% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8603
----------------------------------------------------------+-------------
6355.18MB 100% | github.com/liftbridge-io/liftbridge-api/go.(*FetchMetadataResponse).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:8604
6355.18MB 5.35% 69.34% 6355.18MB 5.35% | github.com/liftbridge-io/liftbridge-api/go.(*StreamMetadata).Unmarshal /root/go/pkg/mod/github.com/liftbridge-io/liftbridge-api@v1.1.1-0.20201029165056-10f2aa65f256/go/api.pb.go:10712
----------------------------------------------------------+-------------
4334.02MB 79.97% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
1071.64MB 19.77% | github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
13.81MB 0.25% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
5419.47MB 4.56% 73.91% 5419.47MB 4.56% | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:303
----------------------------------------------------------+-------------
2789.17MB 79.04% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1518
724.54MB 20.53% | github.com/liftbridge-io/go-liftbridge/v2.(*client).FetchMetadata /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1241
15MB 0.43% | github.com/liftbridge-io/go-liftbridge/v2.(*client).subscribe /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/client.go:1543
3528.72MB 2.97% 76.88% 3528.72MB 2.97% | github.com/liftbridge-io/go-liftbridge/v2.(*metadataCache).update /root/go/pkg/mod/github.com/liftbridge-io/go-liftbridge/v2@v2.0.2-0.20201119170214-c842e2f19749/metadata.go:279
Keeping track of the brokers for a cluster is obviously necessary but do we need all the streams? Could individual stream partitions be fetched into the cache on demand?
Yes, this is an area for improvement I've had in mind. The client should only fetch the streams it needs. Also, the FetchMetadata
RPC already supports this. It just defaults to fetching everything if no streams are specified, so it should be a fairly simple change.
When subscribing to a stream, if the stream is not in the
metadataCache
the cache is completely updated resulting in theFetchMedata
RPC being called. This can happen if the stream is new or does not exist.https://github.com/liftbridge-io/go-liftbridge/blob/c842e2f19749a5aa5a7664b1a6d4a366034d8f92/v2/client.go#L1518
We only create a stream when a message is published to save on unnecessary creates but this means that if multiple subscribers attempt to subscribe before the publisher then a lot of
FetchMetadata
RPCs are made. When there are 1000s (I had 3000 when I hit this) of streams in the liftbridge cluster the metadata gets very big and marshalling it so frequently caused one of my liftbridge servers to become unresponsive and all the memory on my client to be used up.Our liftbridge client service has liftbridge client connections to multiple liftbridge clusters so even storing the full metadata for each cluster is more memory than we would like. Keeping track of the brokers for a cluster is obviously necessary but do we need all the streams? Could individual stream partitions be fetched into the cache on demand?