nginxinc / nginx-ingress-operator

WARNING - DEPRECATION NOTICE: The NGINX Ingress Operator has been updated to be a Helm based operator. This repo has been deprecated and will soon be archived - the new NGINX Ingress Operator repo can be found at https://github.com/nginxinc/nginx-ingress-helm-operator.
Apache License 2.0
66 stars 30 forks source link

Nginx Ingress Operator v0.3.0 OOMKILL #139

Closed vknemanavar closed 3 years ago

vknemanavar commented 3 years ago

Describe the bug Nginx Ingress Operator v0.3.0 pod keep on restarting with OOMKILL

To Reproduce Steps to reproduce the behavior:

  1. Go to Operator Hub
  2. Search and select for nginx certified operator and click on install
  3. oc describe pod nginx-ingress-operator-controller-manager-7c9d8899f8-dkt6b -n openshift-operators

Name: nginx-ingress-operator-controller-manager-7c9d8899f8-dkt6b Namespace: openshift-operators Priority: 0 Node: 10.73.184.244/10.73.184.244 Start Time: Fri, 16 Jul 2021 15:16:02 +0530 Labels: control-plane=controller-manager pod-template-hash=7c9d8899f8 Annotations: alm-examples: [ { "apiVersion": "k8s.nginx.org/v1alpha1", "kind": "NginxIngressController", "metadata": { "name": "my-nginx-ingress-controller" }, "spec": { "image": { "pullPolicy": "Always", "repository": "docker.io/nginx/nginx-ingress", "tag": "1.12.0-ubi" }, "ingressClass": "nginx", "nginxPlus": false, "serviceType": "NodePort", "type": "deployment" } } ] capabilities: Basic Install cni.projectcalico.org/podIP: 172.30.162.209/32 cni.projectcalico.org/podIPs: 172.30.162.209/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "", "ips": [ "172.30.162.209" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "ips": [ "172.30.162.209" ], "default": true, "dns": {} }] olm.operatorGroup: global-operators olm.operatorNamespace: openshift-operators olm.targetNamespaces: openshift.io/scc: restricted operatorframework.io/properties: {"properties":[{"type":"olm.gvk","value":{"group":"k8s.nginx.org","kind":"NginxIngressController","version":"v1alpha1"}},{"type":"olm.pack... operators.operatorframework.io/builder: operator-sdk-v1.8.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v3 Status: Running IP: 172.30.162.209 IPs: IP: 172.30.162.209 Controlled By: ReplicaSet/nginx-ingress-operator-controller-manager-7c9d8899f8 Containers: kube-rbac-proxy: Container ID: cri-o://a3b2029f7b667244b6689da14e15cb6ce10ae085579e7912f715197d1727d559 Image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:6d0286b8a8f6f3cd9d6cd8319400acf27b70fbb52df5808ec6fe2d9849be7d8c Image ID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:6d0286b8a8f6f3cd9d6cd8319400acf27b70fbb52df5808ec6fe2d9849be7d8c Port: 8443/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:8443 --upstream=http://127.0.0.1:8080/ --logtostderr=true --v=10 State: Running Started: Fri, 16 Jul 2021 15:16:20 +0530 Ready: True Restart Count: 0 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-operator-controller-manager-token-rpmpw (ro) manager: Container ID: cri-o://58cb0c4b2705fda2a97f39368c2b48c593c0ad339a35d82d28a5b99decdd4316 Image: registry.connect.redhat.com/nginx/nginx-ingress-operator@sha256:519b5ebc20fa938dab50842a053cedea7dffeec07360ee66c4aac43f1bc63f9f Image ID: registry.connect.redhat.com/nginx/nginx-ingress-operator@sha256:519b5ebc20fa938dab50842a053cedea7dffeec07360ee66c4aac43f1bc63f9f Port: Host Port: Command: /manager Args: --health-probe-bind-address=:8081 --metrics-bind-address=127.0.0.1:8080 --leader-elect State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Fri, 16 Jul 2021 15:20:01 +0530 Finished: Fri, 16 Jul 2021 15:20:27 +0530 Ready: False Restart Count: 4 Limits: cpu: 500m memory: 128Mi Requests: cpu: 250m memory: 64Mi Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3 Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-operator-controller-manager-token-rpmpw (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nginx-ingress-operator-controller-manager-token-rpmpw: Type: Secret (a volume populated by a Secret) SecretName: nginx-ingress-operator-controller-manager-token-rpmpw Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled Successfully assigned openshift-operators/nginx-ingress-operator-controller-manager-7c9d8899f8-dkt6b to 10.73.184.244 Normal AddedInterface 4m46s multus Add eth0 [172.30.162.209/32] Normal Pulled 4m46s kubelet, 10.73.184.244 Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:6d0286b8a8f6f3cd9d6cd8319400acf27b70fbb52df5808ec6fe2d9849be7d8c" already present on machine Normal Created 4m45s kubelet, 10.73.184.244 Created container kube-rbac-proxy Normal Started 4m45s kubelet, 10.73.184.244 Started container kube-rbac-proxy Normal Pulled 2m18s (x4 over 4m45s) kubelet, 10.73.184.244 Container image "registry.connect.redhat.com/nginx/nginx-ingress-operator@sha256:519b5ebc20fa938dab50842a053cedea7dffeec07360ee66c4aac43f1bc63f9f" already present on machine Normal Created 2m18s (x4 over 4m45s) kubelet, 10.73.184.244 Created container manager Normal Started 2m18s (x4 over 4m45s) kubelet, 10.73.184.244 Started container manager Warning Unhealthy 2m13s (x4 over 4m13s) kubelet, 10.73.184.244 Readiness probe failed: Get "http://172.30.162.209:8081/readyz": dial tcp 172.30.162.209:8081: connect: connection refused Warning BackOff 112s (x6 over 3m36s) kubelet, 10.73.184.244 Back-off restarting failed container

Expected behavior Nginx ingress operator pod should run with restarting

Your environment IBM ROKS 4.6

Additional context Even I tried increasing the Memory to 512 but still it keeps restarting with OOMKILL

soneillf5 commented 3 years ago

Hi @vknemanavar

Thanks for logging this issue. We are investigating this issue now.

Based on the logs you provided and our investigation, the resource limits are the cause of the issue. Looking at the logs the service's QOS class is Burstable. According to this guide, https://docs.openshift.com/container-platform/3.6/dev_guide/compute_resources.html#quality-of-service-tiers it's Burstable as the resource limits are not the same as the requests. According to the guide:

... If there is an out of memory event on the node, `Burstable` containers are killed after `BestEffort` containers when attempting to recover memory.

So if any pod or container causes an OOM event, the Burstable container will be killed.

While we continue to investigate, you can edit the manifest yaml of the operator and remove those limits.

vknemanavar commented 3 years ago

I could do that editing the yaml, but it should have part of the operator why this step need to be done additionally

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 7 days with no activity.