OKD v3.11 masters have much higher than expected load

stratus-ss commented 5 years ago

I have been running OpenShift Origin for a long time. I did a blue/green from 3.7 to 3.11 two weeks ago. I have noticed a significant increase in load and cpu usage on the masters since the upgrade. It is not uncommon to see a load of 6 or more while the cluster is relatively idle. The cluster consists of the following virtual hosts running KVM

Node 1: CentOS 7 16 cores 96G of ram 4X128G SSD in Raid 10

Node 2: Ubuntu 16.04 4c/8t cpus 16G ram 256G SSD

Node 3: Ubuntu 16.04 4c/8t cpus 16G ram 256G SSD

Node 4: Ubuntu 16.04 4c cpus 16G ram 256G SSD

The cluster is currently made up of the following:

3 masters 1 origin lb 2 infra 4 app nodes

All have 50G allocated in docker storage to /var/lib/origin/openshift.local.volumes

They are on the following Node 1 hosts the following Master 1: 7vcpu, 8G ram Infra2: 3 vcpu, 4G ram app4: 5vcpu, 16G ram

Node 2 Infra1: 1vcpu, 2G ram app1: 3vcpu, 8G ram

Node 3 Master2: 2vcpu, 8G ram app2: 3vcpu, 10G ram

Node 4 master3: 3vcpu, 8G ram app3: 2 vcpu, 8G ram

Now the load problem seems to only affect a single master at a time. I have noticed that by rebooting the masters I can "control" which master takes the load problem. When you check tools like 'top' the amount of cpu percentage by process does not seem to make up the load average of the 1, 5, 15 minute read out. Using the built in grafana I am unable to find a significant load source either. My suspicion is that this is related to moving all of the services inside of docker.

Additionally there does not appear to be a space issue as none of the vms exceed 50% in / or /var. /var/lib/origin/openshift.local.volumes on the masters is hovering around 10% or less.

Journalctl and /var/log/messages do not have any errors on log level 2 and so it does not appear to be coming from problems exactly. I did check the minimum recommended requirements and they have not been raised substantially. I notice this problem even if I increase the amount of ram to the masters. As shown below though, while this problem is occurring the masters rarely are using more than 2G of ram of the allocated 8G.

These are the current pods that are deployed

default                             docker-registry-1-k89c7                               1/1       Running            0          7d
default                             registry-console-1-tgrl4                              1/1       Running            3          4d
default                             router-1-54tm9                                        1/1       Running            0          7d
default                             router-1-rff2v                                        1/1       Running            2          7d
kube-service-catalog                apiserver-bsljc                                       1/1       Running            13         7d
kube-service-catalog                apiserver-whh2l                                       1/1       Running            5          7d
kube-service-catalog                apiserver-z42fd                                       1/1       Running            0          10m
kube-service-catalog                controller-manager-44f4x                              1/1       Running            9          7d
kube-service-catalog                controller-manager-qcxbn                              1/1       Running            9          7d
kube-service-catalog                controller-manager-sjkk4                              1/1       Running            0          10m
kube-system                         master-api-origin-311-master1.example.com           1/1       Running            15         3h
kube-system                         master-api-origin-311-master2.example.com           1/1       Running            12         8d
kube-system                         master-api-origin-311-master3.example.com           1/1       Running            6          8d
kube-system                         master-controllers-origin-311-master1.example.com   1/1       Running            7          3h
kube-system                         master-controllers-origin-311-master2.example.com   1/1       Running            9          8d
kube-system                         master-controllers-origin-311-master3.example.com   1/1       Running            7          8d
kube-system                         master-etcd-origin-311-master1.example.com          1/1       Running            17         3h
kube-system                         master-etcd-origin-311-master2.example.com          1/1       Running            10         8d
kube-system                         master-etcd-origin-311-master3.example.com          1/1       Running            7          8d
minetest                            minetest-3-wps78                                      1/1       Running            0          1d
minetest                            minetest-cronjob-1541289660-mdswz                     0/1       Pending            0          7d
minetest                            minetest-cronjob-1541548860-zwn2g                     0/1       Pending            0          4d
minetest                            minetest-cronjob-1541808060-xdthw                     0/1       Pending            0          1d
monitoring                          smokeping-2-npq7f                                     1/1       Running            0          1d
nextcloud                           mariadb-centos7-1-hq8f5                               1/1       Running            4          7d
nextcloud                           nextcloud-2-deploy                                    0/1       Error              0          1d
nextcloud                           nextcloud-3-ssb6d                                     1/1       Running            1          1d
openshift-ansible-service-broker    asb-1-sscjx                                           1/1       Running            18         7d
openshift-console                   console-5677c7c58d-cnhrb                              1/1       Running            3          4d
openshift-console                   console-5677c7c58d-dbr75                              1/1       Running            10         7d
openshift-console                   console-5677c7c58d-zd7fs                              1/1       Running            5          7d
openshift-infra                     hawkular-cassandra-1-h64rs                            1/1       Running            0          7d
openshift-infra                     hawkular-metrics-dwhdb                                1/1       Running            0          7d
openshift-infra                     hawkular-metrics-schema-2d4fx                         0/1       Completed          0          7d
openshift-infra                     heapster-bzkd8                                        1/1       Running            4          7d
openshift-metrics-server            metrics-server-77b76b677d-525gl                       1/1       Running            2          7d
openshift-monitoring                alertmanager-main-0                                   3/3       Running            0          7d
openshift-monitoring                alertmanager-main-1                                   3/3       Running            6          7d
openshift-monitoring                alertmanager-main-2                                   3/3       Running            0          7d
openshift-monitoring                cluster-monitoring-operator-6465f8fbc7-gwmhl          1/1       Running            0          7d
openshift-monitoring                grafana-6b9f85786f-jt5wq                              2/2       Running            0          7d
openshift-monitoring                kube-state-metrics-7449d589bc-h7r6b                   3/3       Running            0          7d
openshift-monitoring                node-exporter-4qwvh                                   2/2       Running            0          7d
openshift-monitoring                node-exporter-6h5zg                                   2/2       Running            0          7d
openshift-monitoring                node-exporter-8979j                                   2/2       Running            4          7d
openshift-monitoring                node-exporter-gbsv6                                   2/2       Running            18         7d
openshift-monitoring                node-exporter-j8x45                                   2/2       Running            10         7d
openshift-monitoring                node-exporter-m62b5                                   2/2       Running            4          7d
openshift-monitoring                node-exporter-pckrz                                   2/2       Running            0          9m
openshift-monitoring                node-exporter-txmhw                                   2/2       Running            0          7d
openshift-monitoring                node-exporter-xrhvz                                   2/2       Running            8          7d
openshift-monitoring                prometheus-k8s-0                                      4/4       Running            1          7d
openshift-monitoring                prometheus-k8s-1                                      4/4       Running            9          7d
openshift-monitoring                prometheus-operator-6644b8cd54-2z72x                  1/1       Running            0          7d
openshift-node                      sync-557ss                                            1/1       Running            0          8d
openshift-node                      sync-69lkm                                            1/1       Running            4          8d
openshift-node                      sync-9zjfd                                            1/1       Running            2          8d
openshift-node                      sync-bvhkd                                            1/1       Running            2          8d
openshift-node                      sync-cgw2b                                            1/1       Running            1          52m
openshift-node                      sync-kgts5                                            1/1       Running            9          8d
openshift-node                      sync-kj5jb                                            1/1       Running            0          8d
openshift-node                      sync-wdsqd                                            1/1       Running            0          8d
openshift-node                      sync-z89fh                                            1/1       Running            5          8d
openshift-sdn                       ovs-24zmh                                             1/1       Running            0          8d
openshift-sdn                       ovs-6dz4x                                             1/1       Running            0          53m
openshift-sdn                       ovs-b6lsh                                             1/1       Running            0          8d
openshift-sdn                       ovs-dqb6c                                             1/1       Running            11         8d
openshift-sdn                       ovs-g4pww                                             1/1       Running            5          8d
openshift-sdn                       ovs-lkhk2                                             1/1       Running            0          8d
openshift-sdn                       ovs-lmj5b                                             1/1       Running            4          8d
openshift-sdn                       ovs-mf9rw                                             1/1       Running            2          8d
openshift-sdn                       ovs-ttzxp                                             1/1       Running            2          8d
openshift-sdn                       sdn-26fkd                                             1/1       Running            0          8d
openshift-sdn                       sdn-5mps4                                             1/1       Running            8          8d
openshift-sdn                       sdn-88vb4                                             1/1       Running            2          8d
openshift-sdn                       sdn-h6xvg                                             1/1       Running            4          8d
openshift-sdn                       sdn-j9q9z                                             1/1       Running            0          9m
openshift-sdn                       sdn-mgxr2                                             1/1       Running            0          8d
openshift-sdn                       sdn-pgx5w                                             1/1       Running            2          8d
openshift-sdn                       sdn-tpqxr                                             1/1       Running            11         8d
openshift-sdn                       sdn-z77f9                                             1/1       Running            0          8d
openshift-template-service-broker   apiserver-9zwlf                                       1/1       Running            11         7d
openshift-template-service-broker   apiserver-k5pll                                       1/1       Running            0          9m
openshift-template-service-broker   apiserver-scgqp                                       1/1       Running            491        7d
openshift-web-console               webconsole-7df4f9f689-2dtgn                           1/1       Running            3          4d
openshift-web-console               webconsole-7df4f9f689-hbw47                           1/1       Running            13         7d
openshift-web-console               webconsole-7df4f9f689-td82m                           1/1       Running            6          7d
repos                               apt-cacher-1-btdhl                                    1/1       Running            0          1d
repos                               yum-repo-1-55vlq                                      1/1       Running            0          1d
someapp                             app1-2-ptlt4                                        1/1       Running            0          1d
someapp                             database1-2-bzs5c                                     0/1       ImagePullBackOff   0          1d

Here is the ansible hosts file used to install

origin-311-master1.example.com 
origin-311-master3.example.com
origin-311-master2.example.com

[etcd]
origin-311-master1.example.com
origin-311-master3.example.com
origin-311-master2.example.com

[apps]
origin-311-app1
origin-311-app2
origin-311-app3

[infra]
origin-311-infra1
origin-311-infra2

[nodes]
origin-311-master1.example.com openshift_node_group_name='node-config-master'
origin-311-master2.example.com openshift_node_group_name='node-config-master'
origin-311-master3.example.com openshift_node_group_name='node-config-master'
origin-311-app1.example.com openshift_node_group_name='node-config-compute'
origin-311-app2.example.com openshift_node_group_name='node-config-compute'
origin-311-app3.example.com openshift_node_group_name='node-config-compute'
origin-311-infra1.example.com openshift_node_group_name='node-config-infra'
origin-311-infra2.example.com openshift_node_group_name='node-config-infra'
origin-311-app4.example.com openshift_node_group_name='node-config-compute'

[lb]
origin-311-lb.example.com

[OSEv3:children]
masters
nodes
lb

[OSEv3:vars]
ansible_user=root
debug_level=2
openshift_deployment_type=origin
openshift_release=3.11

openshift_clock_enabled=true
openshift_disable_check=disk_availability,memory_availability
openshift_enable_unsupported_configurations=true

openshift_install_examples=true
container_runtime_extra_storage='[{"device":"/dev/vdb","path":"/var/lib/origin/openshift.local.volumes","filesystem":"xfs","options":"gquota"}]'

docker_version="1.13.1"

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_htpasswd_users={'admin': '$apr1$cQRY0feq$gzcIIXQ6WnuVA2Qkq0miZ1'}

osm_use_cockpit=true
osm_cockpit_plugins=['cockpit-kubernetes']

openshift_master_cluster_hostname=master311.example.com
openshift_master_cluster_public_hostname=master311.example.com
openshift_master_cluster_method=native

openshift_master_default_subdomain=apps311.example.com

osn_storage_plugin_deps=[]

openshift_hosted_registry_cert_expire_days=2000

openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=192.168.221.96
openshift_hosted_registry_storage_nfs_directory=/storage/vms/origin_nfs/
openshift_hosted_registry_storage_volume_name=registry311
openshift_hosted_registry_storage_volume_size=20Gi

openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_host=192.168.221.96
openshift_metrics_storage_nfs_directory=/storage/vms/origin_nfs/
openshift_metrics_storage_volume_name=metrics311
openshift_metrics_storage_volume_size=20Gi
openshift_metrics_storage_labels={'storage': 'metrics'}

openshift_cluster_monitoring_operator_prometheus_storage_capacity="50Gi"
openshift_cluster_monitoring_operator_alertmanager_storage_capacity="2Gi"

 os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'

osm_cluster_network_cidr=10.127.0.0/18
openshift_portal_net=172.29.0.0/18
osm_host_subnet_length=9

logrotate_scripts=[{"name": "syslog", "path": "/var/log/cron\n/var/log/maillog\n/var/log/messages\n/var/log/secure\n/var/log/spooler\n", "options": ["daily", "rotate 7", "compress", "sharedscripts", "missingok"], "scripts": {"postrotate": "/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true"}}]

openshift_use_dnsmasq=True

openshift_generate_no_proxy_hosts=True

openshift_enable_service_catalog=true
template_service_broker_install=true
openshift_template_service_broker_namespaces=['openshift']

openshift_ca_cert_expire_days=2000
openshift_node_cert_expire_days=2000
openshift_master_cert_expire_days=2000
etcd_ca_default_days=2000

openshift_management_project=openshift-management

openshift_management_app_template=miq-template
openshift_management_storage_class=nfs

openshift_management_storage_nfs_external_hostname=192.168.221.96

openshift_management_storage_nfs_base_dir=/storage/vms/origin_nfs/miq

openshift_management_username=admin
openshift_management_password=smartvm

Version

kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Steps To Reproduce

Install OpenShift 3.11 with the above ansible playbook
Put a few apps on the cluster and wait

Current Result

Example from master1

Tasks: 200 total,   2 running, 198 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.3 us,  5.2 sy,  0.0 ni, 72.7 id,  0.5 wa,  0.0 hi,  0.8 si,  3.6 st
KiB Mem :  8173936 total,  4557212 free,  1800076 used,  1816648 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6036804 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                           
 1769 root      20   0 1338220 314360  51112 S  67.9  3.8   1:39.16 openshift                                         
 1760 root      20   0 2058720 654508  57180 S  59.9  8.0   3:33.17 openshift                                         
 4749 root      20   0 5646052 432276  61664 R  54.6  5.3   1:15.45 etcd                                              
 1081 root      20   0 1276236  64620  16160 S   6.6  0.8   0:56.59 dockerd-current                                   
 3419 root      20   0 1688884 114528  46988 S   4.0  1.4   0:29.73 hyperkube                                         
 1727 root      20   0  347908   4156   1340 S   3.6  0.1   0:04.85 docker-containe                                   
 2403 root      20   0  822560  80360  44312 S   1.0  1.0   0:04.32 openshift                                         
 4781 1000260+  20   0  522640  35332  16420 S   0.7  0.4   0:01.99 template-servic   ```

```# free -m
              total        used        free      shared  buff/cache   available
Mem:           7982        1760        4442           3        1779        5893

The current CPU usage and memory usage according to grafan is as follows:

Webconsole:


Pod | CPU Usage | CPU Requests | CPU Requests %
-- | -- | -- | --
webconsole-7df4f9f689-2dtgn | 0.02 | 0.10 | 20.51%
webconsole-7df4f9f689-td82m | 0.02 | 0.10 | 20.66%
webconsole-7df4f9f689-hbw47 | 0.02 | 0.10 | 20.43%

Pod | Memory Usage | Memory Requests | Memory Requests %
-- | -- | -- | --
webconsole-7df4f9f689-hbw47 | 14.55 MB | 104.86 MB | 13.88%
webconsole-7df4f9f689-td82m | 55.10 MB | 104.86 MB | 52.55%
webconsole-7df4f9f689-2dtgn | 13.51 MB | 104.86 MB | 12.89%

Service broker:


Pod | CPU Usage
-- | --
apiserver-scgqp | 0.00
apiserver-k5pll | 0.00
apiserver-9zwlf | 0.00

Pod | Memory Usage
-- | --
apiserver-k5pll | 22.53 MB
apiserver-9zwlf | 21.00 MB
apiserver-scgqp | 66.38 MB

Openshift SDN


Pod | CPU Usage | CPU Requests | CPU Requests % | CPU Limits | CPU Limits %
-- | -- | -- | -- | -- | --
sdn-mgxr2 | 0.01 | 0.10 | 14.62% | - | -
sdn-z77f9 | 0.01 | 0.10 | 12.75% | - | -
sdn-pgx5w | 0.01 | 0.10 | 11.26% | - | -
sdn-5mps4 | 0.01 | 0.10 | 10.02% | - | -
sdn-tpqxr | 0.01 | 0.10 | 8.90% | - | -
sdn-j9q9z | 0.01 | 0.10 | 10.23% | - | -
sdn-26fkd | 0.01 | 0.10 | 8.54% | - | -
sdn-88vb4 | 0.01 | 0.10 | 7.38% | - | -
sdn-h6xvg | 0.01 | 0.10 | 5.87% | - | -
ovs-lkhk2 | 0.00 | 0.10 | 4.34% | 0.20 | 2.17%
ovs-b6lsh | 0.00 | 0.10 | 4.21% | 0.20 | 2.17%
ovs-g4pww | 0.00 | 0.10 | 2.93% | 0.20 | 1.46%

Pod | Memory Usage | Memory Requests | Memory Requests % | Memory Limits | Memory Limits %
-- | -- | -- | -- | -- | --
sdn-h6xvg | 225.28 MB | 209.72 MB | 107.42% | - | -
sdn-pgx5w | 206.43 MB | 209.72 MB | 98.44% | - | -
sdn-j9q9z | 204.99 MB | 209.72 MB | 97.75% | - | -
sdn-88vb4 | 90.29 MB | 209.72 MB | 43.05% | - | -
ovs-6dz4x | 85.81 MB | 314.57 MB | 27.28% | 419.43 MB | 20.46%
sdn-z77f9 | 85.44 MB | 209.72 MB | 40.74% | - | -
sdn-mgxr2 | 68.23 MB | 209.72 MB | 32.53% | - | -
ovs-b6lsh | 58.11 MB | 314.57 MB | 18.47% | 419.43 MB | 13.85%
ovs-g4pww | 54.73 MB | 314.57 MB | 17.40% | 419.43 MB | 13.05%
ovs-dqb6c | 54.39 MB | 314.57 MB | 17.29% | 419.43 MB | 12.97%
sdn-26fkd | 50.33 MB | 209.72 MB | 24.00% | - | -
sdn-5mps4 | 49.28 MB | 209.72 MB | 23.50% | - | -
ovs-lmj5b | 48.21 MB | 314.57 MB | 15.32% | 419.43 MB | 11.49%
ovs-mf9rw | 46.76 MB | 314.57 MB | 14.86% | 419.43 MB | 11.15%
sdn-tpqxr | 46.54 MB | 209.72 MB | 22.19% | - | -
ovs-lkhk2 | 41.71 MB | 314.57 MB | 13.26% | 419.43 MB | 9.95%
ovs-ttzxp | 33.91 MB | 314.57 MB | 10.78% | 419.43 MB | 8.09%
ovs-24zmh | 33.86 MB | 314.57 MB | 10.76% | 419.43 MB | 8.07%

Openshift Node:


Pod | CPU Usage
-- | --
sync-wdsqd | 0.01
sync-kgts5 | 0.01
sync-z89fh | 0.01
sync-557ss | 0.01
sync-bvhkd | 0.01
sync-cgw2b | 0.00

Pod | Memory Usage
-- | --
sync-bvhkd | 101.21 MB
sync-cgw2b | 96.95 MB
sync-69lkm | 85.64 MB
sync-z89fh | 84.50 MB
sync-9zjfd | 47.97 MB
sync-kgts5 | 45.75 MB
sync-wdsqd | 33.83 MB
sync-557ss | 20.69 MB
sync-kj5jb | 8.58 MB

Openshift Monitoring


Pod | CPU Usage | CPU Requests | CPU Requests % | CPU Limits | CPU Limits %
-- | -- | -- | -- | -- | --
prometheus-k8s-0 | 0.13 | 0.02 | 838.76% | 0.02 | 944.25%
prometheus-k8s-1 | 0.06 | 0.02 | 415.32% | 0.02 | 410.51%
grafana-6b9f85786f-jt5wq | 0.01 | 0.10 | 14.37% | 0.20 | 7.80%
alertmanager-main-2 | 0.01 | 0.01 | 184.05% | 0.01 | 182.69%
alertmanager-main-0 | 0.01 | 0.01 | 166.13% | 0.01 | 167.23%
kube-state-metrics-7449d589bc-h7r6b | 0.01 | 0.02 | 34.84% | 0.04 | 17.01%

Pod | Memory Usage | Memory Requests | Memory Requests % | Memory Limits | Memory Limits %
-- | -- | -- | -- | -- | --
prometheus-k8s-0 | 1.86 GB | 62.91 MB | 2949.03% | 62.91 MB | 2947.66%
prometheus-k8s-1 | 1.19 GB | 62.91 MB | 1889.94% | 62.91 MB | 1889.94%
kube-state-metrics-7449d589bc-h7r6b | 77.99 MB | 41.94 MB | 185.95% | 83.89 MB | 92.92%
grafana-6b9f85786f-jt5wq | 66.52 MB | 104.86 MB | 63.44% | 209.72 MB | 31.72%
node-exporter-8979j | 58.53 MB | 20.97 MB | 279.08% | 41.94 MB | 139.54%
node-exporter-gbsv6 | 56.10 MB | 20.97 MB | 267.52% | 41.94 MB | 133.76%

Metrics server

Pod | CPU Usage
-- | --
metrics-server-77b76b677d-525gl | 0.01

OpenShift Infra

Pod | CPU Usage
-- | --
heapster-bzkd8 | 0.01
hawkular-metrics-dwhdb | 0.08
hawkular-cassandra-1-h64rs | 0.50

Pod | Memory Usage | Memory Requests | Memory Requests % | Memory Limits | Memory Limits %
-- | -- | -- | -- | -- | --
hawkular-cassandra-1-h64rs | 1.68 GB | 1.00 GB | 168.31% | 2.00 GB | 84.15%
heapster-bzkd8 | 93.79 MB | 937.50 MB | 10.00% | 3.75 GB | 2.50%
hawkular-metrics-dwhdb | 860.78 MB | 1.50 GB | 57.39% | 2.50 GB | 34.43%

Openshift console


Pod | CPU Usage | CPU Requests | CPU Requests % | CPU Limits | CPU Limits %
-- | -- | -- | -- | -- | --
console-5677c7c58d-zd7fs | 0.00 | 0.10 | 2.50% | 0.10 | 2.45%
console-5677c7c58d-dbr75 | 0.00 | 0.10 | 2.70% | 0.10 | 2.66%
console-5677c7c58d-cnhrb | 0.00 | 0.10 | 2.37% | 0.10 | 2.34%

Pod | Memory Usage | Memory Requests | Memory Requests % | Memory Limits | Memory Limits %
-- | -- | -- | -- | -- | --
console-5677c7c58d-dbr75 | 7.32 MB | 104.86 MB | 6.98% | 104.86 MB | 6.98%
console-5677c7c58d-zd7fs | 16.31 MB | 104.86 MB | 15.55% | 104.86 MB | 15.55%
console-5677c7c58d-cnhrb | 7.23 MB | 104.86 MB | 6.89% | 104.86 MB | 6.89%

Kube-system


Pod | CPU Usage
-- | --
master-api-origin-311-master1.example.com | 0.56
master-controllers-origin-311-master1.example.com | 0.52
master-etcd-origin-311-master1.example.com | 0.41
master-etcd-origin-311-master3.example.com | 0.22
master-api-origin-311-master3.example.com | 0.14
master-api-origin-311-master2.example.com | 0.12
master-etcd-origin-311-master2.example.com | 0.12
master-controllers-origin-311-master2.example.com | 0.02
master-controllers-origin-311-master3.example.com | 0.01

Pod | Memory Usage
-- | --
master-api-origin-311-master1.example.com | 806.73 MB
master-api-origin-311-master3.example.com | 732.27 MB
master-api-origin-311-master2.example.com | 606.87 MB
master-etcd-origin-311-master3.example.com | 432.41 MB
master-etcd-origin-311-master1.example.com | 398.86 MB
master-controllers-origin-311-master1.example.com | 355.75 MB
master-controllers-origin-311-master2.example.com | 139.15 MB
master-controllers-origin-311-master3.example.com | 45.77 MB

Kube serviuce catelog


Pod | CPU Usage
-- | --
controller-manager-qcxbn | 0.01
apiserver-bsljc | 0.00
apiserver-whh2l | 0.00
apiserver-z42fd | 0.00
controller-manager-44f4x | 0.00
controller-manager-sjkk4 | 0.00

Pod | Memory Usage
-- | --
controller-manager-qcxbn | 62.81 MB
apiserver-whh2l | 61.35 MB
apiserver-bsljc | 28.17 MB
apiserver-z42fd | 18.47 MB
controller-manager-44f4x | 12.05 MB
controller-manager-sjkk4 | 11.19 MB

Default

Pod | CPU Usage | CPU Requests | CPU Requests %
-- | -- | -- | --
router-1-54tm9 | 0.01 | 0.10 | 7.72%
router-1-rff2v | 0.01 | 0.10 | 6.33%
docker-registry-1-k89c7 | 0.00 | 0.10 | 2.54%
registry-console-1-tgrl4 | 0.00 | - | -

Pod | Memory Usage | Memory Requests | Memory Requests %
-- | -- | -- | --
router-1-rff2v | 49.98 MB | 268.44 MB | 18.62%
router-1-54tm9 | 42.74 MB | 268.44 MB | 15.97%
registry-console-1-tgrl4 | 34.36 MB | - | -
docker-registry-1-k89c7 | 17.24 MB | 268.44 MB | 6.42%

Expected Result

I expect that the load average on the masters would be similar to Origin 3.7, or at very least, that the output of tools like 'top' et al would identify where the extra load is coming from.

stratus-ss commented 5 years ago

I have noticed another interesting thing of note in this version. Even though I am running 3 masters, if I lose any of them, the master api service becomes unavailable. Thus doing an "oc get pods" returns the following:

[root@origin-311-master1 ~]# oc get pods
Unable to connect to the server: EOF

smarterclayton commented 5 years ago

It's likely something is driving a significant amount of traffic to the masters. Please check prometheus / the metrics endpoints for the apiserver and find the apiserver_request_count numbers so we can eliminate a rogue client.

stratus-ss commented 5 years ago

I created a gist of the metrics here: https://gist.github.com/stratus-ss/ffc630d76d62d7808631184ad259cf51

TLDR these are the biggest offenders

curl -s -k --cert /etc/origin/master/admin.crt --key /etc/origin/master/admin.key https://master.local:8443/metrics |grep apiserver_request_count |sort -k 4 -n

apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="services",scope="namespace",subresource="",verb="GET"} 7699
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="namespaces",scope="cluster",subresource="",verb="GET"} 10288
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="status",verb="PATCH"} 13272
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="nodes",scope="cluster",subresource="",verb="GET"} 13274
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/leader-election",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="PUT"} 15222
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 17435
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 26117
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="serviceaccounts",scope="namespace",subresource="",verb="GET"} 26117
apiserver_request_count{client="hyperkube/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="namespace",subresource="",verb="GET"} 31946
apiserver_request_count{client="service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election",code="200",contentType="application/json",resource="configmaps",scope="namespace",subresource="",verb="PUT"} 37935
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/leader-election",code="200",contentType="application/vnd.kubernetes.protobuf",resource="configmaps",scope="namespace",subresource="",verb="GET"} 45810
apiserver_request_count{client="service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election",code="200",contentType="application/json",resource="configmaps",scope="namespace",subresource="",verb="GET"} 62067
apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0",code="200",contentType="application/vnd.kubernetes.protobuf",resource="apiservices",scope="cluster",subresource="status",verb="PUT"} 103986

Is this sufficient or do you have other curl commands that you would like me to run?

rapecik commented 5 years ago

Hello I might be influenced as well. I have similar setup as @stratus-ss. When I restart master node, neither etcd nor api-server (both in kube-system project) will start succesfully on the restarted node. Right now they both are on Crash loop back off state. It happens to me periodically after each cluster reinstallation when I restart master node.

Etcd crashes on not fullfiling the healthcheck on time and its logs is full of messages where there are text like "unexpected EOFs" and "i/o timeouts". I was measuring the network speed between masters and there is 1 Gbit speed so I do not understand the timeouts. I attach logs from etcd and api-server

master-etcd-master2.okd.local.td.log

master-api-master2.okd.local.td.log

stratus-ss commented 5 years ago

Further troubleshooting:

I have taken the following action without much affect:

I set the following Daemon sets to a "dummy" nodeSelector

apiserver controller-manager node-exporter

In the openshift-monitoring project I have also scaled down the replicas to 0 the following

oc scale deployments.apps/prometheus-operator --replicas=0
oc scale deployments.apps/kube-state-metrics --replicas=0
oc scale deployments.apps/grafana --replicas=0
oc scale deployments.apps/cluster-monitoring-operator --replicas=0
oc scale statefulset.apps/prometheus-k8s --replicas=0
oc scale statefulset.apps/alertmanager-main --replicas=0

I also tried scaling down

oc project openshift-ansible-service-broker
oc scale deploymentconfig.apps.openshift.io/asb --replicas=0

None of these actions have made any significant impact on the cluster

dlakatos847 commented 5 years ago

Related to #19240?

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 5 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 5 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 5 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/origin/issues/21465#issuecomment-499884935): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kevout97 commented 4 years ago

I had the similar issue, every night my okd cluster consumed all cpu resources so after review etcd's logs, I found the next line

2020-07-25 23:46:17.226335 W | etcdserver: read-only range request "key:\"/kubernetes.io/statefulsets\" range_end:\"/kubernetes.io/statefulsett\" count_only:true " with result "error:etcdserver: request timed out" took too long (16.712383983s) to execute

Searching a little bit I found this link https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

so I only increased cpu resources from 8 cores to 14 cores and it worked for me

openshift / origin