Closed Yancey1989 closed 8 years ago
Metrics: Bandwidth(Throughput), Packet Forwarding Rate, Network Latency
Environment: Host: CentOS 7 kubernetes: v1.2.0 with flannel 使用iperf3 pod内跨host:465 Mbits/sec pod外跨host:934 Mbits/sec
ping延迟: pod内:
64 bytes from 10.1.16.2: icmp_seq=1897 ttl=62 time=0.520 ms
64 bytes from 10.1.16.2: icmp_seq=1898 ttl=62 time=0.547 ms
64 bytes from 10.1.16.2: icmp_seq=1899 ttl=62 time=0.508 ms
64 bytes from 10.1.16.2: icmp_seq=1900 ttl=62 time=0.501 ms
64 bytes from 10.1.16.2: icmp_seq=1901 ttl=62 time=0.452 ms
64 bytes from 10.1.16.2: icmp_seq=1902 ttl=62 time=0.468 ms
64 bytes from 10.1.16.2: icmp_seq=1903 ttl=62 time=0.498 ms
64 bytes from 10.1.16.2: icmp_seq=1904 ttl=62 time=0.534 ms
64 bytes from 10.1.16.2: icmp_seq=1905 ttl=62 time=0.650 ms
64 bytes from 10.1.16.2: icmp_seq=1906 ttl=62 time=0.611 ms
64 bytes from 10.1.16.2: icmp_seq=1907 ttl=62 time=0.509 ms
64 bytes from 10.1.16.2: icmp_seq=1908 ttl=62 time=0.571 ms
pod外:
64 bytes from 172.24.3.165: icmp_seq=23 ttl=64 time=0.369 ms
64 bytes from 172.24.3.165: icmp_seq=24 ttl=64 time=0.269 ms
64 bytes from 172.24.3.165: icmp_seq=25 ttl=64 time=0.414 ms
64 bytes from 172.24.3.165: icmp_seq=26 ttl=64 time=0.408 ms
64 bytes from 172.24.3.165: icmp_seq=27 ttl=64 time=0.370 ms
64 bytes from 172.24.3.165: icmp_seq=28 ttl=64 time=0.540 ms
64 bytes from 172.24.3.165: icmp_seq=29 ttl=64 time=0.390 ms
64 bytes from 172.24.3.165: icmp_seq=30 ttl=64 time=0.302 ms
64 bytes from 172.24.3.165: icmp_seq=31 ttl=64 time=0.360 ms
64 bytes from 172.24.3.165: icmp_seq=32 ttl=64 time=0.307 ms
64 bytes from 172.24.3.165: icmp_seq=33 ttl=64 time=0.318 ms
64 bytes from 172.24.3.165: icmp_seq=34 ttl=64 time=0.333 ms
64 bytes from 172.24.3.165: icmp_seq=35 ttl=64 time=0.331 ms
64 bytes from 172.24.3.165: icmp_seq=36 ttl=64 time=0.415 ms
64 bytes from 172.24.3.165: icmp_seq=37 ttl=64 time=0.383 ms
64 bytes from 172.24.3.165: icmp_seq=38 ttl=64 time=0.440 ms
64 bytes from 172.24.3.165: icmp_seq=39 ttl=64 time=0.362 ms
64 bytes from 172.24.3.165: icmp_seq=40 ttl=64 time=0.298 ms
64 bytes from 172.24.3.165: icmp_seq=41 ttl=64 time=0.329 ms
64 bytes from 172.24.3.165: icmp_seq=42 ttl=64 time=0.318 ms
64 bytes from 172.24.3.165: icmp_seq=43 ttl=64 time=0.357 ms
Looking into Using VLAN or L2 mode
Some useful linke:
Try using the GCE mode. this requires a lot of testing.
--configure-cbr0=true
flag to see if cbr0 bridge worksdeploy kube master, following: https://coreos.com/kubernetes/docs/latest/getting-started.html then deploy worker to configure cbr0 when start kubelet(worker) with systemctl conf:
[Service]
ExecStartPre=/usr/bin/mkdir -p /etc/kubernetes/manifests
Environment=KUBELET_VERSION=v1.2.4_coreos.1
ExecStart=/usr/lib/coreos/kubelet-wrapper \
--pod_infra_container_image=typhoon1986/pause:2.0 \
--api-servers=http://172.24.3.150:8080 \
--network-plugin-dir=/etc/kubernetes/cni/net.d \
--network-plugin=${NETWORK_PLUGIN} \
--register-schedulable=false \
--allow-privileged=true \
--config=/etc/kubernetes/manifests \
--hostname-override=172.24.3.150 \
--cluster-dns=10.0.0.10 \
--cluster-domain=cluster.local
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
using journalctl -xef -n 1000 -t kubelet-wrapper
to show the logs
seems the kubelet is not configuring cbr0:
Jun 13 05:38:26 hlg-wuyi-coreos-02 kubelet-wrapper[4922]: I0613 05:38:26.585822 4922 kubelet.go:2365] skipping pod synchronization - [ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now]
Jun 13 05:38:31 hlg-wuyi-coreos-02 kubelet-wrapper[4922]: I0613 05:38:31.586851 4922 kubelet.go:2365] skipping pod synchronization - [ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now]
Jun 13 05:38:31 hlg-wuyi-coreos-02 kubelet-wrapper[4922]: W0613 05:38:31.727614 4922 kubelet.go:2780] ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now
Jun 13 05:38:36 hlg-wuyi-coreos-02 kubelet-wrapper[4922]: I0613 05:38:36.587136 4922 kubelet.go:2365] skipping pod synchronization - [ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now]
I've changed kube-wrapper for configure-cbr0, adding --net=host and brctl volumn, then cbr0 bridge created successfully
set -e
if [ -z "${KUBELET_VERSION}" ]; then
echo "ERROR: must set KUBELET_VERSION"
exit 1
fi
KUBELET_ACI="${KUBELET_ACI:-quay.io/coreos/hyperkube}"
mkdir --parents /etc/kubernetes
mkdir --parents /var/lib/docker
mkdir --parents /var/lib/kubelet
mkdir --parents /run/kubelet
exec /usr/bin/rkt run \
--volume etc-kubernetes,kind=host,source=/etc/kubernetes \
--volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates \
--volume var-lib-docker,kind=host,source=/var/lib/docker \
--volume var-lib-kubelet,kind=host,source=/var/lib/kubelet \
--volume os-release,kind=host,source=/usr/lib/os-release \
--volume run,kind=host,source=/run \
--volume brctl,kind=host,source=/sbin/brctl \
--mount volume=etc-kubernetes,target=/etc/kubernetes \
--mount volume=etc-ssl-certs,target=/etc/ssl/certs \
--mount volume=var-lib-docker,target=/var/lib/docker \
--mount volume=var-lib-kubelet,target=/var/lib/kubelet \
--mount volume=os-release,target=/etc/os-release \
--mount volume=run,target=/run \
--mount volume=brctl,target=/sbin/brctl \
--trust-keys-from-https \
$RKT_OPTS --net=host \
--stage1-from-dir=stage1-fly.aci \
${KUBELET_ACI}:${KUBELET_VERSION} --exec=/kubelet -- "$@"
using GCE mode networking perf compare: between hosts, ping latency:
64 bytes from 10.6.2.2: icmp_seq=8 ttl=63 time=0.393 ms
64 bytes from 10.6.2.2: icmp_seq=9 ttl=63 time=0.412 ms
64 bytes from 10.6.2.2: icmp_seq=10 ttl=63 time=0.433 ms
64 bytes from 10.6.2.2: icmp_seq=11 ttl=63 time=0.368 ms
64 bytes from 10.6.2.2: icmp_seq=12 ttl=63 time=0.530 ms
64 bytes from 10.6.2.2: icmp_seq=13 ttl=63 time=0.392 ms
64 bytes from 10.6.2.2: icmp_seq=14 ttl=63 time=0.382 ms
64 bytes from 10.6.2.2: icmp_seq=15 ttl=63 time=0.397 ms
64 bytes from 10.6.2.2: icmp_seq=16 ttl=63 time=0.362 ms
64 bytes from 10.6.2.2: icmp_seq=17 ttl=63 time=0.421 ms
64 bytes from 10.6.2.2: icmp_seq=18 ttl=63 time=0.389 ms
64 bytes from 10.6.2.2: icmp_seq=19 ttl=63 time=0.417 ms
between pods(across host), ping latency:
64 bytes from 10.6.2.2: seq=0 ttl=62 time=0.578 ms
64 bytes from 10.6.2.2: seq=1 ttl=62 time=0.537 ms
64 bytes from 10.6.2.2: seq=2 ttl=62 time=0.436 ms
64 bytes from 10.6.2.2: seq=3 ttl=62 time=0.404 ms
64 bytes from 10.6.2.2: seq=4 ttl=62 time=0.584 ms
64 bytes from 10.6.2.2: seq=5 ttl=62 time=0.565 ms
64 bytes from 10.6.2.2: seq=6 ttl=62 time=0.504 ms
64 bytes from 10.6.2.2: seq=7 ttl=62 time=0.439 ms
64 bytes from 10.6.2.2: seq=8 ttl=62 time=0.476 ms
64 bytes from 10.6.2.2: seq=9 ttl=62 time=0.430 ms
between hosts, bandwidth:
936 Mbits/sec
between pods, bandwidth:
933 Mbits/sec
912 Mbits/sec
Conclusion: in GCE mode, bandwidth between pods are quit same to the bandwidth between hosts, but latency still grows.
will test L2 mode later.
using flannel "host-gw" mode: ping latency between pods(across hosts)
64 bytes from 10.8.19.2: seq=14 ttl=62 time=0.509 ms
64 bytes from 10.8.19.2: seq=15 ttl=62 time=0.568 ms
64 bytes from 10.8.19.2: seq=16 ttl=62 time=0.590 ms
64 bytes from 10.8.19.2: seq=17 ttl=62 time=0.460 ms
64 bytes from 10.8.19.2: seq=18 ttl=62 time=0.377 ms
64 bytes from 10.8.19.2: seq=19 ttl=62 time=0.389 ms
64 bytes from 10.8.19.2: seq=20 ttl=62 time=0.392 ms
64 bytes from 10.8.19.2: seq=21 ttl=62 time=0.562 ms
64 bytes from 10.8.19.2: seq=22 ttl=62 time=0.487 ms
64 bytes from 10.8.19.2: seq=23 ttl=62 time=0.441 ms
64 bytes from 10.8.19.2: seq=24 ttl=62 time=0.532 ms
64 bytes from 10.8.19.2: seq=25 ttl=62 time=0.560 ms
64 bytes from 10.8.19.2: seq=26 ttl=62 time=0.563 ms
64 bytes from 10.8.19.2: seq=27 ttl=62 time=0.398 ms
ping latency between hosts:
64 bytes from 172.24.3.220: icmp_seq=11 ttl=64 time=0.405 ms
64 bytes from 172.24.3.220: icmp_seq=12 ttl=64 time=0.381 ms
64 bytes from 172.24.3.220: icmp_seq=13 ttl=64 time=0.355 ms
64 bytes from 172.24.3.220: icmp_seq=14 ttl=64 time=0.351 ms
64 bytes from 172.24.3.220: icmp_seq=15 ttl=64 time=0.367 ms
64 bytes from 172.24.3.220: icmp_seq=16 ttl=64 time=0.414 ms
64 bytes from 172.24.3.220: icmp_seq=17 ttl=64 time=0.372 ms
64 bytes from 172.24.3.220: icmp_seq=18 ttl=64 time=0.328 ms
64 bytes from 172.24.3.220: icmp_seq=19 ttl=64 time=0.355 ms
64 bytes from 172.24.3.220: icmp_seq=20 ttl=64 time=0.334 ms
64 bytes from 172.24.3.220: icmp_seq=21 ttl=64 time=0.295 ms
bandwidth between pods (across hosts):
[ 4] local 10.8.92.2 port 34686 connected to 10.8.19.2 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 110 MBytes 920 Mbits/sec 2 406 KBytes
[ 4] 1.00-2.00 sec 112 MBytes 937 Mbits/sec 0 584 KBytes
[ 4] 2.00-3.00 sec 112 MBytes 942 Mbits/sec 0 717 KBytes
[ 4] 3.00-4.00 sec 112 MBytes 942 Mbits/sec 0 827 KBytes
[ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 925 KBytes
[ 4] 5.00-6.00 sec 112 MBytes 938 Mbits/sec 0 1015 KBytes
[ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 1.07 MBytes
[ 4] 7.00-8.00 sec 112 MBytes 941 Mbits/sec 0 1.15 MBytes
[ 4] 8.00-9.00 sec 112 MBytes 944 Mbits/sec 0 1.22 MBytes
[ 4] 9.00-10.00 sec 111 MBytes 933 Mbits/sec 0 1.28 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.09 GBytes 937 Mbits/sec 2 sender
[ 4] 0.00-10.00 sec 1.09 GBytes 935 Mbits/sec receiver
bandwidth between hosts:
[ 4] local 172.24.3.221 port 59212 connected to 172.24.3.220 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 112 MBytes 943 Mbits/sec 2 419 KBytes
[ 4] 1.00-2.00 sec 112 MBytes 939 Mbits/sec 0 591 KBytes
[ 4] 2.00-3.00 sec 113 MBytes 946 Mbits/sec 0 721 KBytes
[ 4] 3.00-4.00 sec 112 MBytes 940 Mbits/sec 1 829 KBytes
[ 4] 4.00-5.00 sec 113 MBytes 945 Mbits/sec 0 929 KBytes
[ 4] 5.00-6.00 sec 112 MBytes 938 Mbits/sec 0 1018 KBytes
[ 4] 6.00-7.00 sec 99.7 MBytes 836 Mbits/sec 0 1.06 MBytes
[ 4] 7.00-8.00 sec 113 MBytes 946 Mbits/sec 0 1.14 MBytes
[ 4] 8.00-9.00 sec 112 MBytes 943 Mbits/sec 0 1.21 MBytes
[ 4] 9.00-10.00 sec 112 MBytes 944 Mbits/sec 0 1.28 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.08 GBytes 932 Mbits/sec 3 sender
[ 4] 0.00-10.00 sec 1.08 GBytes 929 Mbits/sec receiver
突然想到一个方法:
_问题:_ 这个方法如何完成用cloud-config自动启动worker呢?
这个方案已经验证可行了!nginx性能已经可以达到物理机性能的90%以上。但是软中断会称为瓶颈,正在找解决方案
尝试绑定网卡队列到不同cpu上,使用命令echo [affinity] > /proc/irq/[irq_id]/smp_affinity
设置每个网卡队列的绑定。测试结果发现,如果队列都绑定在不同的CPU上,性能也会急剧下降。测试结论是,只有绑定到一个CPU上才是性能最高的。考虑是因为nginx的worker进程只启动了一个
如上图,使用perf top
命令的输出,可以看到性能消耗占用较大的调用均在内核的netfilter, iptables部分。docker daemon启动的时候会增加iptables的一些默认规则,如下:
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DOCKER-ISOLATION all -- anywhere anywhere
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain DOCKER (1 references)
target prot opt source destination
Chain DOCKER-ISOLATION (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
@kingyueyang 是否可以让docker不使用这些规则呢?会有什么影响?
没有找到 Docker daemon 的启动配置文件,但是我导出了 iptables 的信息如下:
core@kubernetes-master ~ $ sudo iptables-save
# Generated by iptables-save v1.4.21 on Mon Jul 25 21:12:55 2016
*nat
:PREROUTING ACCEPT [510:20400]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [4:240]
:POSTROUTING ACCEPT [4:240]
:DOCKER - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODEPORTS - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SEP-ZCDIBLDR7Z235RP3 - [0:0]
:KUBE-SERVICES - [0:0]
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SEP-ZCDIBLDR7Z235RP3 -s 172.24.101.1/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZCDIBLDR7Z235RP3 -p tcp -m comment --comment "default/kubernetes:https" -m tcp -j DNAT --to-destination 172.24.101.1:443
-A KUBE-SERVICES -d 10.100.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-ZCDIBLDR7Z235RP3
COMMIT
# Completed on Mon Jul 25 21:12:55 2016
使用 --iptables=false 可以阻止 Docker 修改 iptables,但是理论上讲失去一切转发功能。
使用下面的脚本卸载iptables的内核模块
#!/bin/bash
#/sbin/modprobe --version 2>&1 | grep -q module-init-tools \
# && NEW_MODUTILS=1 \
# || NEW_MODUTILS=0
NEW_MODUTILS=1
IPTABLES=iptables
IPV=ip
PROC_IPTABLES_NAMES=/proc/net/${IPV}_tables_names
NF_TABLES=$(cat "$PROC_IPTABLES_NAMES" 2>/dev/null)
echo $PROC_IPTABLES_NAMES
echo $NF_TABLES
rmmod_r() {
# Unload module with all referring modules.
# At first all referring modules will be unloaded, then the module itself.
local mod=$1
local ret=0
local ref=
# Get referring modules.
# New modutils have another output format.
[ $NEW_MODUTILS = 1 ] \
&& ref=$(lsmod | awk "/^${mod}/ { print \$4; }" | tr ',' ' ') \
|| ref=$(lsmod | grep ^${mod} | cut -d "[" -s -f 2 | cut -d "]" -s -f 1)
# recursive call for all referring modules
for i in $ref; do
rmmod_r $i
let ret+=$?;
done
# Unload module.
# The extra test is for 2.6: The module might have autocleaned,
# after all referring modules are unloaded.
if grep -q "^${mod}" /proc/modules ; then
modprobe -r $mod > /dev/null 2>&1
res=$?
[ $res -eq 0 ] || echo -n " $mod"
let ret+=$res;
fi
return $ret
}
flush_n_delete() {
# Flush firewall rules and delete chains.
[ ! -e "$PROC_IPTABLES_NAMES" ] && return 0
# Check if firewall is configured (has tables)
[ -z "$NF_TABLES" ] && return 1
echo -n $"${IPTABLES}: Flushing firewall rules: "
ret=0
# For all tables
for i in $NF_TABLES; do
# Flush firewall rules.
$IPTABLES -t $i -F;
let ret+=$?;
# Delete firewall chains.
$IPTABLES -t $i -X;
let ret+=$?;
# Set counter to zero.
$IPTABLES -t $i -Z;
let ret+=$?;
done
#[ $ret -eq 0 ] && success || failure
echo
return $ret
}
flush_n_delete
rmmod_r iptable_nat
rmmod_r nf_nat_ipv4
rmmod_r iptable_filter
rmmod_r ip_tables
rmmod_r x_tables
rmmod_r nf_nat
rmmod_r nf_conntrack
在docker.service中,增加下面的drop-in配置,在服务启动之后卸载netfilter和conntrack的内核模块,就可以完全禁止docker使用iptables:参考:https://github.com/docker/docker/issues/16964#event-630903918
# vim /etc/systemd/system/docker.service.d/modprobe.conf
[Service]
ExecStartPost=-/bin/systemctl restart iptables
ExecStartPost=-/sbin/modprobe -r nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_conntrack
然后重启docker服务:
# systemctl daemon-reload
# systemctl restart docker
# lsmod | grep conntrack
发现相关的内核模块已经卸载掉了。之后再执行ab测试,结果就可以达到物理机性能的100%!!!
[root@k8s-node-1 ~]# ab -c 50 -n 90000 http://172.24.103.2/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 172.24.103.2 (be patient)
Completed 9000 requests
Completed 18000 requests
Completed 27000 requests
Completed 36000 requests
Completed 45000 requests
Completed 54000 requests
Completed 63000 requests
Completed 72000 requests
Completed 81000 requests
Completed 90000 requests
Finished 90000 requests
Server Software: nginx/1.11.1
Server Hostname: 172.24.103.2
Server Port: 80
Document Path: /
Document Length: 612 bytes
Concurrency Level: 50
Time taken for tests: 6.055 seconds
Complete requests: 90000
Failed requests: 0
Total transferred: 76050000 bytes
HTML transferred: 55080000 bytes
Requests per second: 14864.62 [#/sec] (mean)
Time per request: 3.364 [ms] (mean)
Time per request: 0.067 [ms] (mean, across all concurrent requests)
Transfer rate: 12266.21 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.7 0 6
Processing: 0 3 1.4 3 14
Waiting: 0 3 1.4 3 14
Total: 2 3 1.5 3 15
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 3
80% 3
90% 5
95% 5
98% 6
99% 14
100% 15 (longest request)
但此时会引来其他的问题:
kube-proxy设置Service的相关命令都会失败。此时需要更换kube-proxy的方式,或者提交Service都用LoadBalancer方式?
Kubernetes 应该如何在flannel,Calico等底层网络模块中去选择,性能如何?