stolostron / multicluster-global-hub

the main repository for the multicluster global hub
Apache License 2.0
20 stars 28 forks source link

Investigate the manager footprint #848

Closed yanmxa closed 4 months ago

yanmxa commented 5 months ago

Initialize the manager

image

The manager memory is about 43 MB[2024-03-20 11:40:00 AM]

yanmxa commented 5 months ago

Use the pprof to analyze the manager

image

Even if the manager's memory is less than 60 MB. But we can still see that there is a small fluctuation around 50 MB.

go tool pprof --alloc_space --source_path $(go env GOPATH)/pkg http://localhost:6062/debug/pprof/heap
...
File: manager
Build ID: 7a0e85034a2264a9abb8354fd7a827a55383ec15
Type: alloc_space
Time: Mar 20, 2024 at 4:00pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 337.87MB, 69.21% of 488.18MB total
Dropped 289 nodes (cum <= 2.44MB)
Showing top 10 nodes out of 179
      flat  flat%   sum%        cum   cum%
  194.80MB 39.90% 39.90%   239.78MB 49.12%  compress/flate.NewWriter
   40.58MB  8.31% 48.22%    40.58MB  8.31%  compress/flate.(*compressor).initDeflate (inline)
   23.31MB  4.77% 52.99%    37.31MB  7.64%  github.com/prometheus/client_golang/prometheus.(*Registry).Gather
   22.90MB  4.69% 57.68%    22.90MB  4.69%  github.com/confluentinc/confluent-kafka-go/v2/kafka.NewProducer
      21MB  4.30% 61.98%    24.50MB  5.02%  github.com/confluentinc/confluent-kafka-go/v2/kafka.(*handle).eventPoll
    8.07MB  1.65% 63.64%     8.07MB  1.65%  strings.Fields
       8MB  1.64% 65.27%    13.50MB  2.77%  github.com/prometheus/client_golang/prometheus.processMetric
    7.21MB  1.48% 66.75%     7.21MB  1.48%  regexp.(*bitState).reset
    6.01MB  1.23% 67.98%     6.01MB  1.23%  io.ReadAll
       6MB  1.23% 69.21%        6MB  1.23%  path.Join
(pprof)
go tool pprof --inuse_space --source_path $(go env GOPATH)/pkg http://localhost:6062/debug/pprof/heap
...
File: manager
Build ID: 7a0e85034a2264a9abb8354fd7a827a55383ec15
Type: inuse_space
Time: Mar 20, 2024 at 4:02pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 31702.91kB, 91.15% of 34779.53kB total
Showing top 10 nodes out of 131
      flat  flat%   sum%        cum   cum%
   23448kB 67.42% 67.42%    23448kB 67.42%  github.com/confluentinc/confluent-kafka-go/v2/kafka.NewProducer
    1634kB  4.70% 72.12%     1634kB  4.70%  compress/flate.(*compressor).initDeflate (inline)
 1536.51kB  4.42% 76.54%  1536.51kB  4.42%  go.uber.org/zap/zapcore.newCounters
 1042.12kB  3.00% 79.53%  1042.12kB  3.00%  k8s.io/apimachinery/pkg/conversion.ConversionFuncs.AddUntyped
 1025.94kB  2.95% 82.48%  1025.94kB  2.95%  regexp/syntax.(*compiler).inst
  902.59kB  2.60% 85.08%  2536.58kB  7.29%  compress/flate.NewWriter
  553.04kB  1.59% 86.67%   553.04kB  1.59%  github.com/gogo/protobuf/proto.RegisterType
  525.43kB  1.51% 88.18%   525.43kB  1.51%  github.com/google/gnostic/openapiv3.init
  518.65kB  1.49% 89.67%   518.65kB  1.49%  k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  516.64kB  1.49% 91.15%   516.64kB  1.49%  runtime.procresize
(pprof)

list NewProducer
...
   15.27MB    15.27MB    583:   p.events = make(chan Event, eventsChanSize)
    7.63MB     7.63MB    584:   p.produceChannel = make(chan *Message, produceChannelSize)
         .          .    585:   p.pollerTermChan = make(chan bool)
         .          .    586:   p.isClosed = 0
         .          .    587:

Reduce the manager footprint: https://github.com/stolostron/multicluster-global-hub/pull/850

image
yanmxa commented 5 months ago
image

From the trend of memory changes in the above chart, it can be seen that there are no obvious memory issues with the manager without any workload. And the overall memory stays below 60 MB.

yanmxa commented 5 months ago

Reduce the memory by fixing the data loss: https://github.com/stolostron/multicluster-global-hub/pull/864 image

yanmxa commented 5 months ago

After a few days with such a scale, the manager's memory consumption from 80 MB to 180 MB. So there must be a memory leak issue in the current code.

image image image

It seems this bug also exists in the agent: https://github.com/stolostron/multicluster-global-hub/issues/743#issuecomment-2029027167

reference: https://github.com/cloudevents/sdk-go/issues/1030

yanmxa commented 5 months ago

The transport consumer still takes up more resources and it will be traced in the confluent community: https://github.com/confluentinc/confluent-kafka-go/issues/1171