opsnull / follow-me-install-kubernetes-cluster

和我一步步部署 kubernetes 集群
Other
7.44k stars 2.9k forks source link

grpc: the connection is closing; please retry. #570

Open BlackSunday001 opened 4 years ago

BlackSunday001 commented 4 years ago

组件版本

k8s版本::v1. 14.2 etcd版本:3.3.11

集群:

主机名 角色 IP 系统版本 内核版本
node01.tracy.com node01 10.0.20.31 CentOS 7.7 5.4.0-1.el7.elrepo.x86_64
node02.tracy.com node02 10.0.20.32 CentOS 7.7 5.4.0-1.el7.elrepo.x86_64
node03.tracy.com node03 10.0.20.33 CentOS 7.7 5.4.0-1.el7.elrepo.x86_64

各组件配置文件

ETCD 配置文件

[root@node02 ~]# cat /etc/systemd/system/etcd.service 
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos

[Service]
Type=notify
WorkingDirectory=/data/k8s/etcd/data
ExecStart=/opt/k8s/bin/etcd \
  --data-dir=/data/k8s/etcd/data \
  --wal-dir=/data/k8s/etcd/wal \
  --name=node02 \
  --cert-file=/etc/etcd/cert/etcd.pem \
  --key-file=/etc/etcd/cert/etcd-key.pem \
  --trusted-ca-file=/etc/kubernetes/cert/ca.pem \
  --peer-cert-file=/etc/etcd/cert/etcd.pem \
  --peer-key-file=/etc/etcd/cert/etcd-key.pem \
  --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem \
  --peer-client-cert-auth \
  --client-cert-auth \
  --listen-peer-urls=https://10.0.20.32:2380 \
  --initial-advertise-peer-urls=https://10.0.20.32:2380 \
  --listen-client-urls=https://10.0.20.32:2379,http://127.0.0.1:2379 \
  --advertise-client-urls=https://10.0.20.32:2379 \
  --initial-cluster-token=etcd-cluster-0 \
  --initial-cluster=node01=https://10.0.20.31:2380,node02=https://10.0.20.32:2380,node03=https://10.0.20.33:2380 \
  --initial-cluster-state=new \
  --auto-compaction-mode=periodic \
  --auto-compaction-retention=1 \
  --max-request-bytes=33554432 \
  --quota-backend-bytes=6442450944 \
  --heartbeat-interval=250 \
  --election-timeout=2000
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

etcd集群状态

[root@node01 ~]# etcdctl \
> --endpoint=https://10.0.20.31:2379 \
> --ca-file=/etc/kubernetes/cert/ca.pem \
> --cert-file=/etc/etcd/cert/etcd.pem \
> --key-file=/etc/etcd/cert/etcd-key.pem cluster-health
member 20efe5e6128d9e63 is healthy: got healthy result from https://10.0.20.31:2379
member 7cf960cbc106b63f is healthy: got healthy result from https://10.0.20.33:2379
member e6886445a833720c is healthy: got healthy result from https://10.0.20.32:2379
cluster is healthy

flanneld 配置文件

[root@node02 ~]# cat /etc/systemd/system/flanneld.service 
[Unit]
Description=Flanneld overlay address etcd agent
After=network.target
After=network-online.target
Wants=network-online.target
After=etcd.service
Before=docker.service

[Service]
Type=notify
ExecStart=/opt/k8s/bin/flanneld \
  -etcd-cafile=/etc/kubernetes/cert/ca.pem \
  -etcd-certfile=/etc/flanneld/cert/flanneld.pem \
  -etcd-keyfile=/etc/flanneld/cert/flanneld-key.pem \
  -etcd-endpoints=https://10.0.20.31:2379,https://10.0.20.32:2379,https://10.0.20.33:2379 \
  -etcd-prefix=/kubernetes/network \
  -iface=bond0 \
  -ip-masq
ExecStartPost=/opt/k8s/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker
Restart=always
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=multi-user.target
RequiredBy=docker.service

apiserver 配置文件

[root@node02 ~]# cat /etc/systemd/system/kube-apisserver.service 
[Unit]
Description=Kubernetes API Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=network.target

[Service]
WorkingDirectory=/data/k8s/k8s/kube-apiserver
ExecStart=/opt/k8s/bin/kube-apiserver \
  --advertise-address=10.0.20.32 \
  --default-not-ready-toleration-seconds=360 \
  --default-unreachable-toleration-seconds=360 \
  --feature-gates=DynamicAuditing=true \
  --max-mutating-requests-inflight=2000 \
  --max-requests-inflight=4000 \
  --default-watch-cache-size=200 \
  --delete-collection-workers=2 \
  --encryption-provider-config=/etc/kubernetes/encryption-config.yaml \
  --etcd-cafile=/etc/kubernetes/cert/ca.pem \
  --etcd-certfile=/etc/kubernetes/cert/kubernetes.pem \
  --etcd-keyfile=/etc/kubernetes/cert/kubernetes-key.pem \
  --etcd-servers=https://10.0.20.31:2379,https://10.0.20.32:2379,https://10.0.20.33:2379 \
  --bind-address=10.0.20.32 \
  --secure-port=6443 \
  --tls-cert-file=/etc/kubernetes/cert/kubernetes.pem \
  --tls-private-key-file=/etc/kubernetes/cert/kubernetes-key.pem \
  --insecure-port=0 \
  --audit-dynamic-configuration \
  --audit-log-maxage=15 \
  --audit-log-maxbackup=3 \
  --audit-log-maxsize=100 \
  --audit-log-truncate-enabled \
  --audit-log-path=/data/k8s/k8s/kube-apiserver/audit.log \
  --audit-policy-file=/etc/kubernetes/audit-policy.yaml \
  --profiling \
  --anonymous-auth=false \
  --client-ca-file=/etc/kubernetes/cert/ca.pem \
  --enable-bootstrap-token-auth \
  --requestheader-allowed-names="aggregator" \
  --requestheader-client-ca-file=/etc/kubernetes/cert/ca.pem \
  --requestheader-extra-headers-prefix="X-Remote-Extra-" \
  --requestheader-group-headers=X-Remote-Group \
  --requestheader-username-headers=X-Remote-User \
  --service-account-key-file=/etc/kubernetes/cert/ca.pem \
  --authorization-mode=Node,RBAC \
  --runtime-config=api/all=true \
  --enable-admission-plugins=NodeRestriction \
  --allow-privileged=true \
  --apiserver-count=3 \
  --event-ttl=168h \
  --kubelet-certificate-authority=/etc/kubernetes/cert/ca.pem \
  --kubelet-client-certificate=/etc/kubernetes/cert/kubernetes.pem \
  --kubelet-client-key=/etc/kubernetes/cert/kubernetes-key.pem \
  --kubelet-https=true \
  --kubelet-timeout=10s \
  --proxy-client-cert-file=/etc/kubernetes/cert/proxy-client.pem \
  --proxy-client-key-file=/etc/kubernetes/cert/proxy-client-key.pem \
  --service-cluster-ip-range=10.254.0.0/16 \
  --service-node-port-range=1024-32767 \
  --logtostderr=true \
  --v=2
Restart=on-failure
RestartSec=10
Type=notify
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

nginx 配置文件

[root@node02 ~]# cat /opt/k8s/kube-nginx/conf/kube-nginx.conf 
worker_processes 1;
events {
    worker_connections  1024;
}
stream {
    log_format proxy '$remote_addr [$time_local]'
                '$protocol $status $bytes_sent $bytes_received'
                '$session_time "$upstream_addr" '
                '"$upstream_bytes_sent" "$upstream_bytes_received" "$upstream_connect_time"';
    access_log /opt/k8s/kube-nginx/logs/access.log proxy ;
    upstream backend {
        hash  consistent;
        server 10.0.20.31:6443        max_fails=3 fail_timeout=30s;
        server 10.0.20.32:6443        max_fails=3 fail_timeout=30s;
        server 10.0.20.33:6443        max_fails=3 fail_timeout=30s;
    }
    server {
        listen *:8443;
        proxy_connect_timeout 1s;
        proxy_pass backend;
    }
}

nginx监听地址使用127.0.0.1也测试过

问题现象

1、在启动apiserver前

在启动apiserver前都是正常的,但是配置好apiserver后

etcd 查看状态就不正常 就开始报错:

[root@node01 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-11-29 12:18:52 CST; 25min ago
     Docs: https://github.com/coreos
 Main PID: 1342 (etcd)
   CGroup: /system.slice/etcd.service
           └─1342 /opt/k8s/bin/etcd --data-dir=/data/k8s/etcd/data --wal-dir=/data/k8s/etcd/wal --name=node01 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem -...

Nov 29 12:18:54 node01.tracy.com etcd[1342]: rejected connection from "10.0.20.31:39976" (error "EOF", ServerName "")
Nov 29 12:18:54 node01.tracy.com etcd[1342]: rejected connection from "10.0.20.31:39986" (error "EOF", ServerName "")

2、apiserver自身报错

在确保了etcd集群正常的情况,message报错如下:

Nov 29 12:18:55 node01 kube-apiserver: W1129 12:18:55.730235    1339 asm_amd64.s:1337] Failed to dial 10.0.20.32:2379: grpc: the connection is closing; please retry.
Nov 29 12:18:55 node01 kube-apiserver: W1129 12:18:55.730298    1339 asm_amd64.s:1337] Failed to dial 10.0.20.33:2379: grpc: the connection is closing; please retry.

一直会有这样的报错,等一会儿后日志如下:

Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.741463   13902 storage_rbac.go:284] created rolebinding.rbac.authorization.k8s.io/system::leader-locking-kube-scheduler in kube-system
Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.781732   13902 storage_rbac.go:284] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-system
Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.821448   13902 storage_rbac.go:284] created rolebinding.rbac.authorization.k8s.io/system:controller:cloud-provider in kube-system
Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.861600   13902 storage_rbac.go:284] created rolebinding.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system
Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.901997   13902 storage_rbac.go:284] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public
Nov 29 10:50:47 node01 kube-apiserver: W1129 10:50:47.985238   13902 lease.go:222] Resetting endpoints for master service "kubernetes" to [10.0.20.31]
Nov 29 10:50:47 node01 kube-apiserver: I1129 10:50:47.985971   13902 controller.go:606] quota admission added evaluator for: endpoints

最后这里日志 看着 apiserver又是正常的了

apiserver看起来正常,但实际不行,操作如下:

[root@node01 ~]# kubectl get cs
Error from server (BadRequest): the server rejected our request for an unknown reason

我的测试

1、降级etcd 3.2.x 问题依旧

2、降级内核 4.18 问题依旧

3、修改config,直接连接自己的apiserver,问题依旧

还请麻烦您帮忙看看问题在哪里。。谢谢

有可能是我某些看似正常的操作 不正常,但目前已知没找到问题

webnginx886 commented 4 years ago

我也出现过这个错误 asm_amd64.s:1337] Failed to dial 192.168.56.13:2379: grpc: the con nection is closing; please retry.