redhat-cop / namespace-configuration-operator

The namespace-configuration-operator helps keeping configurations related to Users, Groups and Namespaces aligned with one of more policies specified as a CRs
Apache License 2.0
204 stars 55 forks source link

Memory leak #34

Closed sboschman closed 4 years ago

sboschman commented 4 years ago

Seems the namespace-configuration-operator keeps using more and more memory. After a node suffered some oom issues the log showed a namespace-configuration-operation process being oom-killed by the kernel, releasing 150gb of memory.

As the chart currently does not support setting resource limits/requests for the operator process, I used a k8s LimitRange resource to limit the operator to 2gb memory. The following graph shows the operator using more memory as time passes by, until it reaches 2gb and it gets terminated.

namespace-configuration-operator_memory-usage

The cpu resource limit is 2 and this is what docker stats is reporting (it is burning cpu cycles: 204.75%):

CONTAINER ID        NAME                                                                                                                                                             CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
f4a81a066fb9        k8s_namespace-configuration-operator_namespace-configuration-operator-68c8b55c4c-s2sg5_namespace-configuration-operator_b8d3e76b-5e9f-492f-9459-ebd08737e72c_0   204.75%             1.977GiB / 2GiB     98.84%              0B / 0B             807GB / 943kB       23
ghost commented 4 years ago

FWIW I'm seeing the same thing on a 3.11 cluster.

sboschman commented 4 years ago

Seems to be in the metrics part:

Just after start:

Showing nodes accounting for 31415.49kB, 100% of 31415.49kB total
      flat  flat%   sum%        cum   cum%
10754.25kB 34.23% 34.23% 12290.33kB 39.12%  encoding/json.(*decodeState).objectInterface
 4620.32kB 14.71% 48.94%  4620.32kB 14.71%  k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
 2640.83kB  8.41% 57.35%  2640.83kB  8.41%  k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
 2590.11kB  8.24% 65.59%  2590.11kB  8.24%  github.com/beorn7/perks/quantile.newStream
 1536.08kB  4.89% 70.48%  1536.08kB  4.89%  encoding/json.unquote
 1030.01kB  3.28% 73.76%  7186.39kB 22.88%  k8s.io/client-go/restmapper.NewDiscoveryRESTMapper

A few minutes later:

Showing nodes accounting for 188.79MB, 96.66% of 195.32MB total
Dropped 75 nodes (cum <= 0.98MB)
      flat  flat%   sum%        cum   cum%
  116.85MB 59.82% 59.82%   116.85MB 59.82%  github.com/beorn7/perks/quantile.newStream
   27.34MB 14.00% 73.82%    27.34MB 14.00%  k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
      13MB  6.66% 80.48%    15.50MB  7.94%  encoding/json.(*decodeState).objectInterface
   12.55MB  6.42% 86.90%   129.90MB 66.50%  github.com/prometheus/client_golang/prometheus.newSummary
    8.53MB  4.37% 91.27%     8.53MB  4.37%  k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
sboschman commented 4 years ago

Upgraded the dependencies to Operator SDK 0.15 pinning k8s to 1.16.2. After fixen a few compile errors I was able to start the operator locally:

{"level":"info","ts":1579777403.3433406,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1579777403.9105785,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}

{"level":"info","ts":1579777406.6541378,"logger":"controller_namespaceconfig","msg":"adding managed object for","object": ... }
{"level":"info","ts":1579777407.2253845,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1579777407.225979,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}

{"level":"info","ts":1579777407.2702403,"logger":"controller_namespaceconfig","msg":"adding managed object for","object":
{"level":"info","ts":1579777407.8414943,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"error","ts":1579777407.8416977,"logger":"controller-runtime.metrics","msg":"metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts","error":"error listening on :8080: listen tcp :8080: bind: address already in use" ...}

After disabling the creation of a manager for each resource (https://github.com/redhat-cop/namespace-configuration-operator/blob/master/pkg/controller/namespaceconfig/namespaceconfig_controller.go#L518, introduced with #11 ) the memory usage seems stable:

Showing nodes accounting for 2820.76kB, 100% of 2820.76kB total
      flat  flat%   sum%        cum   cum%
  768.26kB 27.24% 27.24%   768.26kB 27.24%  go.uber.org/zap/zapcore.newCounters
  516.01kB 18.29% 45.53%   516.01kB 18.29%  sync.(*Map).dirtyLocked
  512.28kB 18.16% 63.69%   512.28kB 18.16%  k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  512.19kB 18.16% 81.85%   512.19kB 18.16%  runtime.malg
  512.02kB 18.15%   100%  1024.21kB 36.31%  runtime.systemstack