solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.09k stars 442 forks source link

High CPU consumption when having multiple gloo pods and disableLeaderElection: false #8195

Open edubonifs opened 1 year ago

edubonifs commented 1 year ago

Gloo Edge Version

1.13.x

Kubernetes Version

None

Describe the bug

I have installed GE 1.13.15, I have multiple replicas of gloo pods and gloo.gloo.disableLeaderElection: false.

I am seeing that the pod which wasn't chosen as the leader is consuming a lot of CPU:

´´´ gloo-5f6dd5959f-n4g7q 10m 53Mi
gloo-5f6dd5959f-sx2bb 1945m 46Mi ´´´

This pod doesn't report any repeated logs:

edubonilla@Solo-System-EBonilla edge-eks % k logs -n gloo-system gloo-5f6dd5959f-kspr8
{"level":"info","ts":"2023-05-05T07:40:49.151Z","caller":"probes/healthz.go:23","msg":"healthz server starting at :8765"}
{"level":"info","ts":"2023-05-05T07:40:49.152Z","caller":"stats/stats.go:96","msg":"Stats server starting at :9091"}
{"level":"warn","ts":"2023-05-05T07:40:49.153Z","caller":"setup/setup.go:86","msg":"LICENSE WARNING: license expired"}
{"level":"info","ts":"2023-05-05T07:40:49.177Z","logger":"gloo-ee.v1.event_loop","caller":"v1/setup_event_loop.sk.go:79","msg":"event loop started","version":"1.13.15"}
I0505 07:40:49.185726       1 leaderelection.go:248] attempting to acquire leader lease gloo-system/gloo-ee...
{"level":"info","ts":"2023-05-05T07:40:49.212Z","caller":"setup/setup.go:44","msg":"new leader elected with ID: gloo-55c6cf4dfd-hcq8l_15ba9ecb-f25d-44a4-812e-79d4b0501a2c"}
{"level":"info","ts":"2023-05-05T07:40:49.725Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop","caller":"v1/eds_event_loop.sk.go:79","msg":"event loop started","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.760Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer","caller":"discovery/run.go:35","msg":"begin sync 12654921833842216521 (25 upstreams)","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.762Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer","caller":"discovery/discovery.go:193","msg":"Received first EDS update from plugin: *ec2.plugin","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.861Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer","caller":"discovery/run.go:65","msg":"end sync 12654921833842216521","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.862Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer","caller":"discovery/discovery.go:193","msg":"Received first EDS update from plugin: *kubernetes.plugin","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.876Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop","caller":"gloosnapshot/api_event_loop.sk.go:79","msg":"event loop started","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.887Z","logger":"gloo-ee.v1.event_loop.setup","caller":"setup/setup_syncer.go:900","msg":"starting gateway validation server","version":"1.13.15","port":8443,"cert":"/etc/gateway/validation-certs/tls.crt","key":"/etc/gateway/validation-certs/tls.key"}
{"level":"info","ts":"2023-05-05T07:40:49.935Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:92","msg":"begin sync 11181178096221000641 (0 virtual services, 2 gateways, 0 route tables)","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.946Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:104","msg":"end sync 11181178096221000641","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.948Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:80","msg":"begin sync 11181178096221000641 (0 proxies, 25 upstreams, 24 endpoints, 6 secrets, 20 artifacts, 0 auth configs, 0 rate limit configs, 0 graphql apis)","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:40:49.949Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:175","msg":"end sync 11181178096221000641","version":"1.13.15"}
{"level":"info","ts":"2023-05-05T07:41:20.126Z","caller":"setup/setup.go:44","msg":"new leader elected with ID: gloo-5f6dd5959f-k6cxm_28b4d008-1f89-4f48-a18f-76409d9d7b13"}
{"level":"info","ts":"2023-05-05T07:41:22.383Z","caller":"cache/simple.go:276","msg":"open watch Priority Index 1 and Element Index 0 for type.googleapis.com/enterprise.gloo.solo.io.ExtAuthConfig[] from nodeID \"extauth\", version \"empty\""}

I would like to understand the difference of CPU consumption between these two pods.

Steps to reproduce the bug

Install GE enterprise 1.13.15 with the following values.yaml:

gloo:
  gloo:
    disableLeaderElection: false

Scale up the gloo deployment for having more than one replica:

k scale deploy -n gloo-system gloo --replicas=2

You will see that the CPU consumption of the one that is not the leader went high:

gloo-5f6dd5959f-n4g7q                                 10m          53Mi            
gloo-5f6dd5959f-sx2bb                                 1945m        46Mi

Expected Behavior

We expect the two pods to have similar CPU consumption

Additional Context

No response

nfuden commented 1 year ago

@edubonifs what are you deploying this onto? I was unable to replicate on Kind

edubonifs commented 1 year ago

I am deploying on eks

andrzej-talarek commented 1 year ago

Hello. I've also similar problems on EKS.

github-actions[bot] commented 4 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.