Open divfor opened 3 years ago
NTP_SERVER定义为unbond,只能执行 NTP_SERVER=xxx ./install.sh绕过
已知问题,之前修复了忘记重新构建安装包了。刚刚重新构建了,重新下载试一下 https://github.com/k8sli/kubeplay/releases/tag/v0.1.0-alpha.3 。
TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021 15:10:28 +0000 (0:00:00.591) 0:00:05.231 ******
Sunday 05 September 2021 15:10:28 +0000 (0:00:00.046) 0:00:05.278 ******
TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease , E:Some index files failed to download. They have been ignored, or old ones used instead.'
fatal: [node1]: FAILED! => changed=false
msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease , E:Some index files failed to download. They have been ignored, or old ones used instead.'
NO MORE HOSTS LEFT *************************************************************************************************
PLAY RECAP *********************************************************************************************************
node1 : ok=9 changed=3 unreachable=0 failed=1 skipped=17 rescued=0 ignored=0
node2 : ok=9 changed=3 unreachable=0 failed=1 skipped=23 rescued=0 ignored=0
Sunday 05 September 2021 15:11:00 +0000 (0:00:31.961) 0:00:37.240 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache --------------------------------------------------------- 31.96s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.09s
download : download | Download files / images --------------------------------------------------------------- 0.86s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.59s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.40s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.32s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.28s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.13s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Copy nerdctl binary from download dir ---------------------------------- 0.05s
download : download | Get kubeadm binary and list of required images ---------------------------------------- 0.05s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
cluster/bootstrap-os : Configure offline resources repository on yum package manager ------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Create staging directory on remote node ------------------------------------------ 0.05s
download : prep_download | Set image pull/info command for containerd and crio ------------------------------ 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Download nerdctl ------------------------------------------------------- 0.04s
###### 01-cluster-bootstrap-os installation failed ######
192.168.100.25://8080/ubuntu/amd64/bionic/InRelease
这里的 URL 有些问题,可能是配置文件填写错误
在安装包根目录执行 grep 'offline_resources_url' config/kubespray/env.yml
,看下配置是否有误。
root@fredvb:~/kubeplay# grep 'offline_resources_url' config/kubespray/env.yml offline_resources_url: 192.168.100.25:8080
多次执行,随机地,会出现末行错误而终止:
INFO[0000] Creating container nginx
INFO[0000] Creating container registry
✔ The registry container is running.
✔ The nginx container is running.
✖ Error: the http://192.168.100.25:8080/certs/rootCA.crt website is not running, and the status code is 000!
config.yaml 配置文件发一下
这个每次必出现
✔ Updated the apt list file
E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory)
E: Some index files failed to download. They have been ignored, or old ones used instead.
root@fredvb:~/kubeplay# cat config.yaml
compose:
# Compose bootstrap node ip, default is local internal ip
internal_ip: 192.168.100.25
# Nginx http server bind port for download files and packages
nginx_http_port: 8080
# Registry domain for CRI runtime download images
registry_domain: kube.registry.local
kubespray:
# Kubernetes version by default, only support v1.20.6
kube_version: v1.21.4
# For deploy HA cluster you must configure a external apiserver access ip
external_apiserver_access_ip: 192.168.100.5
# Set network plugin to calico with vxlan mode by default
kube_network_plugin: calico
#Container runtime, only support containerd if offline deploy
container_manager: containerd
# Now only support host if use containerd as CRI runtime
etcd_deployment_type: host
# Settings for etcd event server
etcd_events_cluster_setup: true
etcd_events_cluster_enabled: true
# Cluster nodes inventory info
inventory:
all:
vars:
ansible_port: 22
ansible_user: root
ansible_ssh_pass: q1w2e3r4
# ansible_ssh_private_key_file: /kubespray/config/id_rsa
hosts:
node1:
ansible_host: 192.168.100.4
node2:
ansible_host: 192.168.100.5
children:
kube_control_plane:
hosts:
node2:
kube_node:
hosts:
node1:
etcd:
hosts:
node2:
k8s_cluster:
children:
kube_control_plane:
kube_node:
gpu:
hosts: {}
calico_rr:
hosts: {}
### Default parameters ###
## This filed not need config, will auto update,
## if no special requirement, do not modify these parameters.
default:
# NTP server ip address or domain, default is internal_ip
ntp_server:
- 192.168.100.25
# Registry ip address, default is internal_ip
registry_ip: 192.168.100.25
# Offline resource url for download files, default is internal_ip:nginx_http_port
offline_resources_url: 192.168.100.25:8080
# Use nginx and registry provide all offline resources
offline_resources_enabled: true
# Image repo in registry
image_repository: library
# Kubespray container image for deploy user cluster or scale
kubespray_image: "kube.registry.local/library/kubespray:v2.16.0-154-geb42915a"
# Auto generate self-signed certificate for registry domain
generate_domain_crt: true
# For nodes pull image, use 443 as default
registry_https_port: 443
# For push image to this registry, use 5000 as default, and only bind at 127.0.0.1
registry_push_port: 5000
# Set false to disable download all container images on all nodes
download_container: false
default
字段里的参数无特殊情况保持原本的内容即可,不需要修改。这里的文档说明可能不清晰,稍后会修改一下。
default改回去了,现在还是回到以下错误:
TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021 16:25:26 +0000 (0:00:00.613) 0:00:05.384 ******
Sunday 05 September 2021 16:25:26 +0000 (0:00:00.046) 0:00:05.431 ******
TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
msg: 'Failed to update apt cache: unknown reason'
fatal: [node1]: FAILED! => changed=false
msg: 'Failed to update apt cache: unknown reason'
NO MORE HOSTS LEFT *************************************************************************************************
PLAY RECAP *********************************************************************************************************
node1 : ok=9 changed=2 unreachable=0 failed=1 skipped=17 rescued=0 ignored=0
node2 : ok=9 changed=2 unreachable=0 failed=1 skipped=23 rescued=0 ignored=0
Sunday 05 September 2021 16:28:29 +0000 (0:03:03.812) 0:03:09.243 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache -------------------------------------------------------- 183.81s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.11s
download : download | Download files / images --------------------------------------------------------------- 0.87s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.61s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.41s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.27s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.26s
download : prep_download | Create local cache for files and images on control node -------------------------- 0.13s
kubespray-defaults : Populates no_proxy to all hosts -------------------------------------------------------- 0.10s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.08s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.06s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker on localhost ------------------------------ 0.05s
download : prep_download | Check that local user is in group or can become root ----------------------------- 0.05s
download : prep_download | Set a few facts ------------------------------------------------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker ------------------------------------------- 0.05s
✖ ###### 01-cluster-bootstrap-os installation failed ######
root@fredvb:~/kubeplay#
可能是你安装包下载的不对,系统是 ubuntu 18.04 ,下载的安装包也是 18.04 吗
都是18.04. 感觉是iptables没有设置对,nerdctl拉起之后,iptables没有放行8080/443 port
我手工加iptables -A FORWARD -p tcp --dport 8080 -j ACCEPT,这个'Failed to update apt cache: unknown reason'就解决了
E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory) E: Some index files failed to download. They have been ignored, or old ones used instead.
ls 看一下有没有这个目录,出现这个错误的原因就是下载的安装包版本和 OS 不匹配🤔。
没有这个目录,只有一个gz文件和两个目录:
root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# ls
archive.ubuntu.com download.docker.com Packages.gz
我的安装包是kubeplay-v0.1.0-alpha.3-ubuntu-bionic-amd64.tar.gz nodes全是ubuntu server 18.04.5
关于这个local repo,我记得你有个文档提到,如果直接FROM nginx:1.9.1, 两个COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html是错的。我改成COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html/ubuntu就可以了。对于上面这个,好像路径又有所不同。另外,那个文档提到type=tar可以生成tar包导入,但是entrypoint会在import时丢掉,所以内置nginx不会启动,解决这个问题需要在import的时候加上-change 'CMD /usr/sbin/nginx -g "daemon off;"' 选项
又发现2个失败点:
node之前安装了较新版本的containerd,它会报告没有带允许降级选项而放弃,出错退出;
同样的kernel精确版本号4.15.0-154-generic #161-Ubuntu,有的node发现没有bridge-nf-call-iptables行,出错退出;
fatal: [node1]: FAILED! => changed=false
msg: |-
Failed to reload sysctl: net.ipv4.ip_forward = 1
net.ipv4.ip_local_reserved_ports = 30000-32767
sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory
sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-ip6tables: No such file or directory
changed: [node2]
我是使用各个 Linux 发行版 Cloud-init 镜像创建的虚拟机测试的,其他经过修改或者安装了相冲突的包是无法保证能够安装成功。
bridge-nf-call-iptables
这个是必须要开启的内核参数,建议使用全新的机器进行安装。
modprobe br_netfilter解决了这个问题 https://blog.csdn.net/shida_csdn/article/details/99571884
root@node2:~# ll /etc/apt/sources.list.d/offline-resources.list*
-rw-r--r-- 1 root root 66 Sep 6 15:18 /etc/apt/sources.list.d/offline-resources.list
-rw-r--r-- 1 root root 66 Sep 6 14:51 /etc/apt/sources.list.d/offline-resources.list.bak
root@node2:~# apt update
Err:1 http://192.168.100.25:8080/ubuntu/amd64 bionic/ InRelease
Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://192.168.100.25:8080/ubuntu/amd64/bionic/InRelease Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
W: Some index files failed to download. They have been ignored, or old ones used instead.
root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# tree -L 2
.
├── archive.ubuntu.com
│ └── ubuntu
├── download.docker.com
│ └── linux
└── Packages.gz
4 directories, 1 file
终于成功了一次,删除了cgroupv2,重启
===============================================================================
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates --------------------------------------------------------------------------- 4.58s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources -------------------------------------------------------------------------------------- 4.52s
download : download | Download files / images ---------------------------------------------------------------------------------------------------- 0.81s
Gather minimal facts ----------------------------------------------------------------------------------------------------------------------------- 0.65s
Gather necessary facts (hardware) ---------------------------------------------------------------------------------------------------------------- 0.60s
kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver ------------------------------------------------------------------------------ 0.53s
Gather necessary facts (network) ----------------------------------------------------------------------------------------------------------------- 0.42s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm CoreDNS ------------------------------------------------------------------------------- 0.35s
kubernetes-apps/ansible : Kubernetes Apps | Register coredns deployment annotation `createdby` --------------------------------------------------- 0.31s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm Kube-DNS service ---------------------------------------------------------------------- 0.24s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down nodelocaldns Template ----------------------------------------------------------------------- 0.19s
kubernetes-apps/metallb : Kubernetes Apps | Install and configure MetalLB ------------------------------------------------------------------------ 0.18s
kubernetes-apps/metallb : Kubernetes Apps | Set apparmor_enabled --------------------------------------------------------------------------------- 0.14s
kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames ---------------------------------------------------------------------- 0.14s
kubernetes-apps/external_cloud_controller/openstack : External OpenStack Cloud Controller | Generate Manifests ----------------------------------- 0.13s
kubernetes-apps/container_engine_accelerator/nvidia_gpu : Container Engine Acceleration Nvidia GPU | Create manifests for nvidia accelerators ---- 0.11s
kubernetes-apps/csi_driver/cinder : Cinder CSI Driver | Write cacert file ------------------------------------------------------------------------ 0.10s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ---------------------------------------------------------------------------------- 0.10s
download : prep_download | On localhost, check if passwordless root is possible ------------------------------------------------------------------ 0.10s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down Secondary CoreDNS Template ------------------------------------------------------------------ 0.09s
✔ ###### 05-cluster-apps successfully installed ######
✔ ###### kubernetes cluster successfully installed ######
这是我目前还需要手动解决
#!/bin/bash
# one shot
# iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT
# iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT
# for i in nodes; do ssh $i modprobe br_netfilter; done
for h in x99u d9020 fredvb; do
ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*'
done
很奇怪nerdctl拉起的两个容器端口8080 443为啥不给加iptables通过
这是我目前还需要手动解决
#!/bin/bash # one shot # iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT # iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT for h in x99u d9020 fredvb; do ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*' done
这个后期会修复,移除的时候会清理这些存留的文件
config/compose/certs/下面本来放的是2个文件,结果成了目录,所以启动nginx加载证书出错 另外,nginx.conf里面的registry:5000好像也不能自动替换为IP, 手动修复可以通过