Closed stratus-ss closed 5 years ago
I have noticed another interesting thing of note in this version. Even though I am running 3 masters, if I lose any of them, the master api service becomes unavailable. Thus doing an "oc get pods" returns the following:
[root@origin-311-master1 ~]# oc get pods
Unable to connect to the server: EOF
It's likely something is driving a significant amount of traffic to the masters. Please check prometheus / the metrics endpoints for the apiserver and find the apiserver_request_count numbers so we can eliminate a rogue client.
I created a gist of the metrics here: https://gist.github.com/stratus-ss/ffc630d76d62d7808631184ad259cf51
TLDR these are the biggest offenders
curl -s -k --cert /etc/origin/master/admin.crt --key /etc/origin/master/admin.key https://master.local:8443/metrics |grep apiserver_request_count |sort -k 4 -n
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="services",scope="namespace",subresource="",verb="GET"} 7699
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="namespaces",scope="cluster",subresource="",verb="GET"} 10288
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="status",verb="PATCH"} 13272
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="",verb="GET"} 13274
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/leader-election",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="PUT"} 15222
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 17435
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 26117
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="serviceaccounts",scope="namespace",subresource="",verb="GET"} 26117
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 31946
apiserver_request_count{client="service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election",code="200",contentType="application/json",resource="configmaps",scope="namespace",subresource="",verb="PUT"} 37935
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/leader-election",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 45810
apiserver_request_count{client="service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election",code="200",contentType="application/json",resource="configmaps",scope="namespace",subresource="",verb="GET"} 62067
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="apiservices",scope="cluster",subresource="status",verb="PUT"} 103986
Is this sufficient or do you have other curl commands that you would like me to run?
Hello I might be influenced as well. I have similar setup as @stratus-ss. When I restart master node, neither etcd nor api-server (both in kube-system project) will start succesfully on the restarted node. Right now they both are on Crash loop back off state. It happens to me periodically after each cluster reinstallation when I restart master node.
Etcd crashes on not fullfiling the healthcheck on time and its logs is full of messages where there are text like "unexpected EOFs" and "i/o timeouts". I was measuring the network speed between masters and there is 1 Gbit speed so I do not understand the timeouts. I attach logs from etcd and api-server
Further troubleshooting:
I have taken the following action without much affect:
I set the following Daemon sets to a "dummy" nodeSelector
apiserver controller-manager node-exporter
In the openshift-monitoring project I have also scaled down the replicas to 0 the following
oc scale deployments.apps/prometheus-operator --replicas=0
oc scale deployments.apps/kube-state-metrics --replicas=0
oc scale deployments.apps/grafana --replicas=0
oc scale deployments.apps/cluster-monitoring-operator --replicas=0
oc scale statefulset.apps/prometheus-k8s --replicas=0
oc scale statefulset.apps/alertmanager-main --replicas=0
I also tried scaling down
oc project openshift-ansible-service-broker
oc scale deploymentconfig.apps.openshift.io/asb --replicas=0
None of these actions have made any significant impact on the cluster
Related to #19240?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
I had the similar issue, every night my okd cluster consumed all cpu resources so after review etcd's logs, I found the next line
2020-07-25 23:46:17.226335 W | etcdserver: read-only range request "key:\"/kubernetes.io/statefulsets\" range_end:\"/kubernetes.io/statefulsett\" count_only:true " with result "error:etcdserver: request timed out" took too long (16.712383983s) to execute
Searching a little bit I found this link https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean
so I only increased cpu resources from 8 cores to 14 cores and it worked for me
I have been running OpenShift Origin for a long time. I did a blue/green from 3.7 to 3.11 two weeks ago. I have noticed a significant increase in load and cpu usage on the masters since the upgrade. It is not uncommon to see a load of 6 or more while the cluster is relatively idle. The cluster consists of the following virtual hosts running KVM
Node 1: CentOS 7 16 cores 96G of ram 4X128G SSD in Raid 10
Node 2: Ubuntu 16.04 4c/8t cpus 16G ram 256G SSD
Node 3: Ubuntu 16.04 4c/8t cpus 16G ram 256G SSD
Node 4: Ubuntu 16.04 4c cpus 16G ram 256G SSD
The cluster is currently made up of the following:
3 masters 1 origin lb 2 infra 4 app nodes
All have 50G allocated in docker storage to /var/lib/origin/openshift.local.volumes
They are on the following Node 1 hosts the following Master 1: 7vcpu, 8G ram Infra2: 3 vcpu, 4G ram app4: 5vcpu, 16G ram
Node 2 Infra1: 1vcpu, 2G ram app1: 3vcpu, 8G ram
Node 3 Master2: 2vcpu, 8G ram app2: 3vcpu, 10G ram
Node 4 master3: 3vcpu, 8G ram app3: 2 vcpu, 8G ram
Now the load problem seems to only affect a single master at a time. I have noticed that by rebooting the masters I can "control" which master takes the load problem. When you check tools like 'top' the amount of cpu percentage by process does not seem to make up the load average of the 1, 5, 15 minute read out. Using the built in grafana I am unable to find a significant load source either. My suspicion is that this is related to moving all of the services inside of docker.
Additionally there does not appear to be a space issue as none of the vms exceed 50% in / or /var. /var/lib/origin/openshift.local.volumes on the masters is hovering around 10% or less.
Journalctl and /var/log/messages do not have any errors on log level 2 and so it does not appear to be coming from problems exactly. I did check the minimum recommended requirements and they have not been raised substantially. I notice this problem even if I increase the amount of ram to the masters. As shown below though, while this problem is occurring the masters rarely are using more than 2G of ram of the allocated 8G.
These are the current pods that are deployed
Here is the ansible hosts file used to install
Version
Steps To Reproduce
Current Result
Example from master1
The current CPU usage and memory usage according to grafan is as follows:
Webconsole:
Service broker:
Openshift SDN
Openshift Node:
Openshift Monitoring
Metrics server
OpenShift Infra
Openshift console
Kube-system
Kube serviuce catelog
Default
Expected Result
I expect that the load average on the masters would be similar to Origin 3.7, or at very least, that the output of tools like 'top' et al would identify where the extra load is coming from.