Closed cooktheryan closed 2 years ago
Bumping resources request seems to keep the container online
Instance type is t2.medium
Issues do not appear to occur with a default installation when running Fedora but AWS requires modification of daemonset
Going to leave clusters running overnight on the t3.medium sized AWS instance with the request change on RHEL and default with Fedora
This appears to only be an issue with RHEL. I'm going to try to get a RHEL 8.5 beta and see if the issue continues
Seems to be resolved in 8.5
@oglok @fzdarsky should we just identify this as a known issue. If a user sees the following events then modify the daemonset to increase the resources
Resource requests are 1:1 from OpenShift and RHCOS should be derived from RHEL8.4, so I'm wondering whether this is really the root cause.
You're testing on a t2.medium (resp. the newer t3 generation) burstable performance instance. The t2.medium performance baseline is 24 CPU credits / hour, which means it only gets 24/60=40% of the 2 vCPUs, unless it saved up enough credits to be able to burst higher (which provides an unpredictable performance). Can you maybe repeat your test with a fixed performance instance, e.g. the equivalent c5.large?
@fzdarsky I was able to mimic the results within PSI with a system that was 2cpu and 3GB memory
I.e. the problem exists on RHEL8.4, but not Fedora? Are you bumping up the CPU or memory resource requests and by how much?
RHEL is the only one showing the issue
Change on smaller system
resources:
requests:
cpu: 80m
memory: 90Mi
Vs base install
resources:
requests:
cpu: 50m
memory: 70Mi
I would probably want to tune the requests to find out the bare minimum of resources that would need to be increased to fix this but depends how we want to handle it.
@fzdarsky to confirm it is only RHEL 8.4. This goes away in 8.5 as well as the other 8.4 known issue.
kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-flannel-ds-kmm69 1/1 Running 0 12m
kubevirt-hostpath-provisioner kubevirt-hostpath-provisioner-tbc2d 1/1 Running 0 12m
openshift-dns dns-default-g7l6s 2/2 Running 0 12m
openshift-dns node-resolver-mjb42 1/1 Running 0 12m
openshift-ingress router-default-6c96f6bc66-nznkr 1/1 Running 0 12m
openshift-service-ca service-ca-84b44986cb-tfq9v 1/1 Running 0 12m
[ec2-user@ip-172-31-17-235 ~]$ cat /proc/meminfo
MemTotal: 3823500 kB
MemFree: 306772 kB
Resolved in RHEL 8.5
What happened:
On smaller systems 2 vcpu 4gb ram the DNS cotainer within the dns-default deamonset fails health-check and then becomes ready again after a short amount of time.
What you expected to happen:
No containers restart
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I need to report back if similar issues occur on different operating systems. This could potentially be a RHEL only bug
Environment:
Microshift version (use
microshift version
): MicroShift Version: 4.8.0-0.microshift-unknown Base OKD Version: 4.8.0-0.okd-2021-10-10-030117Hardware configuration: amd64
OS (e.g:
cat /etc/os-release
): RHEL 8.4Kernel (e.g.
uname -a
): 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/LinuxOthers:
Relevant Logs
[INFO] SIGTERM: Shutting down servers then terminating [INFO] plugin/health: Going into lameduck mode for 20s