[BUG] DNS pod goes unready

cooktheryan commented 2 years ago

What happened:

On smaller systems 2 vcpu 4gb ram the DNS cotainer within the dns-default deamonset fails health-check and then becomes ready again after a short amount of time.

What you expected to happen:

No containers restart

How to reproduce it (as minimally and precisely as possible):

deploy Microshift amd64 system with 2vcpu and 4gb memory
follow output of kubectl get po -w -n openshift-dns

Anything else we need to know?:

I need to report back if similar issues occur on different operating systems. This could potentially be a RHEL only bug

Environment:

Microshift version (use microshift version): MicroShift Version: 4.8.0-0.microshift-unknown Base OKD Version: 4.8.0-0.okd-2021-10-10-030117
Hardware configuration: amd64
OS (e.g: cat /etc/os-release): RHEL 8.4
Kernel (e.g. uname -a): 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Others:

Relevant Logs

[INFO] SIGTERM: Shutting down servers then terminating [INFO] plugin/health: Going into lameduck mode for 20s

cooktheryan commented 2 years ago

Bumping resources request seems to keep the container online

cooktheryan commented 2 years ago

Instance type is t2.medium

Issues do not appear to occur with a default installation when running Fedora but AWS requires modification of daemonset

cooktheryan commented 2 years ago

Going to leave clusters running overnight on the t3.medium sized AWS instance with the request change on RHEL and default with Fedora

cooktheryan commented 2 years ago

This appears to only be an issue with RHEL. I'm going to try to get a RHEL 8.5 beta and see if the issue continues

cooktheryan commented 2 years ago

Seems to be resolved in 8.5

@oglok @fzdarsky should we just identify this as a known issue. If a user sees the following events then modify the daemonset to increase the resources

fzdarsky commented 2 years ago

Resource requests are 1:1 from OpenShift and RHCOS should be derived from RHEL8.4, so I'm wondering whether this is really the root cause.

You're testing on a t2.medium (resp. the newer t3 generation) burstable performance instance. The t2.medium performance baseline is 24 CPU credits / hour, which means it only gets 24/60=40% of the 2 vCPUs, unless it saved up enough credits to be able to burst higher (which provides an unpredictable performance). Can you maybe repeat your test with a fixed performance instance, e.g. the equivalent c5.large?

cooktheryan commented 2 years ago

@fzdarsky I was able to mimic the results within PSI with a system that was 2cpu and 3GB memory

fzdarsky commented 2 years ago

I.e. the problem exists on RHEL8.4, but not Fedora? Are you bumping up the CPU or memory resource requests and by how much?

cooktheryan commented 2 years ago

RHEL is the only one showing the issue

Change on smaller system

        resources:
          requests:
            cpu: 80m
            memory: 90Mi

Vs base install

          resources:
            requests:
              cpu: 50m
              memory: 70Mi

I would probably want to tune the requests to find out the bare minimum of resources that would need to be increased to fix this but depends how we want to handle it.

cooktheryan commented 2 years ago

@fzdarsky to confirm it is only RHEL 8.4. This goes away in 8.5 as well as the other 8.4 known issue.

kubectl get po -A
NAMESPACE                       NAME                                  READY   STATUS    RESTARTS   AGE
kube-system                     kube-flannel-ds-kmm69                 1/1     Running   0          12m
kubevirt-hostpath-provisioner   kubevirt-hostpath-provisioner-tbc2d   1/1     Running   0          12m
openshift-dns                   dns-default-g7l6s                     2/2     Running   0          12m
openshift-dns                   node-resolver-mjb42                   1/1     Running   0          12m
openshift-ingress               router-default-6c96f6bc66-nznkr       1/1     Running   0          12m
openshift-service-ca            service-ca-84b44986cb-tfq9v           1/1     Running   0          12m
[ec2-user@ip-172-31-17-235 ~]$ cat /proc/meminfo 
MemTotal:        3823500 kB
MemFree:          306772 kB

cooktheryan commented 2 years ago

Resolved in RHEL 8.5

openshift / microshift