k8sli / kubeplay

Deploy kubernetes by kubespray in offline
https://t.me/kubeplay
Apache License 2.0
144 stars 38 forks source link

bug(config): some parameters are incorrect #20

Open divfor opened 3 years ago

divfor commented 3 years ago

config/compose/certs/下面本来放的是2个文件,结果成了目录,所以启动nginx加载证书出错 另外,nginx.conf里面的registry:5000好像也不能自动替换为IP, 手动修复可以通过

divfor commented 3 years ago

NTP_SERVER定义为unbond,只能执行 NTP_SERVER=xxx ./install.sh绕过

muzi502 commented 3 years ago

已知问题,之前修复了忘记重新构建安装包了。刚刚重新构建了,重新下载试一下 https://github.com/k8sli/kubeplay/releases/tag/v0.1.0-alpha.3

divfor commented 3 years ago
TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021  15:10:28 +0000 (0:00:00.591)       0:00:05.231 ******
Sunday 05 September 2021  15:10:28 +0000 (0:00:00.046)       0:00:05.278 ******

TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
  msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease  , E:Some index files failed to download. They have been ignored, or old ones used instead.'
fatal: [node1]: FAILED! => changed=false
  msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease  , E:Some index files failed to download. They have been ignored, or old ones used instead.'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
node1                      : ok=9    changed=3    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0
node2                      : ok=9    changed=3    unreachable=0    failed=1    skipped=23   rescued=0    ignored=0

Sunday 05 September 2021  15:11:00 +0000 (0:00:31.961)       0:00:37.240 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache --------------------------------------------------------- 31.96s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.09s
download : download | Download files / images --------------------------------------------------------------- 0.86s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.59s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.40s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.32s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.28s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.13s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Copy nerdctl binary from download dir ---------------------------------- 0.05s
download : download | Get kubeadm binary and list of required images ---------------------------------------- 0.05s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
cluster/bootstrap-os : Configure offline resources repository on yum package manager ------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Create staging directory on remote node ------------------------------------------ 0.05s
download : prep_download | Set image pull/info command for containerd and crio ------------------------------ 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Download nerdctl ------------------------------------------------------- 0.04s
 ######  01-cluster-bootstrap-os installation failed  ######
muzi502 commented 3 years ago

192.168.100.25://8080/ubuntu/amd64/bionic/InRelease 这里的 URL 有些问题,可能是配置文件填写错误

在安装包根目录执行 grep 'offline_resources_url' config/kubespray/env.yml,看下配置是否有误。

divfor commented 3 years ago

root@fredvb:~/kubeplay# grep 'offline_resources_url' config/kubespray/env.yml offline_resources_url: 192.168.100.25:8080

divfor commented 3 years ago

多次执行,随机地,会出现末行错误而终止:

INFO[0000] Creating container nginx
INFO[0000] Creating container registry
✔ The registry container is running.
✔ The nginx container is running.
✖ Error: the http://192.168.100.25:8080/certs/rootCA.crt website is not running, and the status code is 000!
muzi502 commented 3 years ago

config.yaml 配置文件发一下

divfor commented 3 years ago

这个每次必出现

✔ Updated the apt list file
E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages  File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory)
E: Some index files failed to download. They have been ignored, or old ones used instead.
divfor commented 3 years ago
root@fredvb:~/kubeplay# cat config.yaml
compose:
  # Compose bootstrap node ip, default is local internal ip
  internal_ip: 192.168.100.25
  # Nginx http server bind port for download files and packages
  nginx_http_port: 8080
  # Registry domain for CRI runtime download images
  registry_domain: kube.registry.local
kubespray:
  # Kubernetes version by default, only support v1.20.6
  kube_version: v1.21.4
  # For deploy HA cluster you must configure a external apiserver access ip
  external_apiserver_access_ip: 192.168.100.5
  # Set network plugin to calico with vxlan mode by default
  kube_network_plugin: calico
  #Container runtime, only support containerd if offline deploy
  container_manager: containerd
  # Now only support host if use containerd as CRI runtime
  etcd_deployment_type: host
  # Settings for etcd event server
  etcd_events_cluster_setup: true
  etcd_events_cluster_enabled: true
# Cluster nodes inventory info
inventory:
  all:
    vars:
      ansible_port: 22
      ansible_user: root
      ansible_ssh_pass: q1w2e3r4
      # ansible_ssh_private_key_file: /kubespray/config/id_rsa
    hosts:
      node1:
        ansible_host: 192.168.100.4
      node2:
        ansible_host: 192.168.100.5
    children:
      kube_control_plane:
        hosts:
          node2:
      kube_node:
        hosts:
          node1:
      etcd:
        hosts:
          node2:
      k8s_cluster:
        children:
          kube_control_plane:
          kube_node:
      gpu:
        hosts: {}
      calico_rr:
        hosts: {}
### Default parameters ###
## This filed not need config, will auto update,
## if no special requirement, do not modify these parameters.
default:
  # NTP server ip address or domain, default is internal_ip
  ntp_server:
    - 192.168.100.25
  # Registry ip address, default is internal_ip
  registry_ip: 192.168.100.25
  # Offline resource url for download files, default is internal_ip:nginx_http_port
  offline_resources_url: 192.168.100.25:8080
  # Use nginx and registry provide all offline resources
  offline_resources_enabled: true
  # Image repo in registry
  image_repository: library
  # Kubespray container image for deploy user cluster or scale
  kubespray_image: "kube.registry.local/library/kubespray:v2.16.0-154-geb42915a"
  # Auto generate self-signed certificate for registry domain
  generate_domain_crt: true
  # For nodes pull image, use 443 as default
  registry_https_port: 443
  # For push image to this registry, use 5000 as default, and only bind at 127.0.0.1
  registry_push_port: 5000
  # Set false to disable download all container images on all nodes
  download_container: false
muzi502 commented 3 years ago

default 字段里的参数无特殊情况保持原本的内容即可,不需要修改。这里的文档说明可能不清晰,稍后会修改一下。

divfor commented 3 years ago

default改回去了,现在还是回到以下错误:

TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021  16:25:26 +0000 (0:00:00.613)       0:00:05.384 ******
Sunday 05 September 2021  16:25:26 +0000 (0:00:00.046)       0:00:05.431 ******

TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
  msg: 'Failed to update apt cache: unknown reason'
fatal: [node1]: FAILED! => changed=false
  msg: 'Failed to update apt cache: unknown reason'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
node1                      : ok=9    changed=2    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0
node2                      : ok=9    changed=2    unreachable=0    failed=1    skipped=23   rescued=0    ignored=0

Sunday 05 September 2021  16:28:29 +0000 (0:03:03.812)       0:03:09.243 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache -------------------------------------------------------- 183.81s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.11s
download : download | Download files / images --------------------------------------------------------------- 0.87s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.61s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.41s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.27s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.26s
download : prep_download | Create local cache for files and images on control node -------------------------- 0.13s
kubespray-defaults : Populates no_proxy to all hosts -------------------------------------------------------- 0.10s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.08s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.06s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker on localhost ------------------------------ 0.05s
download : prep_download | Check that local user is in group or can become root ----------------------------- 0.05s
download : prep_download | Set a few facts ------------------------------------------------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker ------------------------------------------- 0.05s
✖ ######  01-cluster-bootstrap-os installation failed  ######
root@fredvb:~/kubeplay#
muzi502 commented 3 years ago

可能是你安装包下载的不对,系统是 ubuntu 18.04 ,下载的安装包也是 18.04 吗

divfor commented 3 years ago

都是18.04. 感觉是iptables没有设置对,nerdctl拉起之后,iptables没有放行8080/443 port

divfor commented 3 years ago

我手工加iptables -A FORWARD -p tcp --dport 8080 -j ACCEPT,这个'Failed to update apt cache: unknown reason'就解决了

muzi502 commented 3 years ago

E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory) E: Some index files failed to download. They have been ignored, or old ones used instead.

ls 看一下有没有这个目录,出现这个错误的原因就是下载的安装包版本和 OS 不匹配🤔。

divfor commented 3 years ago

没有这个目录,只有一个gz文件和两个目录:

root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# ls
archive.ubuntu.com  download.docker.com  Packages.gz

我的安装包是kubeplay-v0.1.0-alpha.3-ubuntu-bionic-amd64.tar.gz nodes全是ubuntu server 18.04.5

divfor commented 3 years ago

关于这个local repo,我记得你有个文档提到,如果直接FROM nginx:1.9.1, 两个COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html是错的。我改成COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html/ubuntu就可以了。对于上面这个,好像路径又有所不同。另外,那个文档提到type=tar可以生成tar包导入,但是entrypoint会在import时丢掉,所以内置nginx不会启动,解决这个问题需要在import的时候加上-change 'CMD /usr/sbin/nginx -g "daemon off;"' 选项

divfor commented 3 years ago

又发现2个失败点:

  1. node之前安装了较新版本的containerd,它会报告没有带允许降级选项而放弃,出错退出;

  2. 同样的kernel精确版本号4.15.0-154-generic #161-Ubuntu,有的node发现没有bridge-nf-call-iptables行,出错退出;

    fatal: [node1]: FAILED! => changed=false
    msg: |-
    Failed to reload sysctl: net.ipv4.ip_forward = 1
    net.ipv4.ip_local_reserved_ports = 30000-32767
    sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory
    sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-ip6tables: No such file or directory
    changed: [node2]
muzi502 commented 3 years ago

我是使用各个 Linux 发行版 Cloud-init 镜像创建的虚拟机测试的,其他经过修改或者安装了相冲突的包是无法保证能够安装成功。

bridge-nf-call-iptables 这个是必须要开启的内核参数,建议使用全新的机器进行安装。

divfor commented 3 years ago

modprobe br_netfilter解决了这个问题 https://blog.csdn.net/shida_csdn/article/details/99571884

divfor commented 3 years ago
  1. install.sh remove不清理offline source list
    root@node2:~# ll /etc/apt/sources.list.d/offline-resources.list*
    -rw-r--r-- 1 root root 66 Sep  6 15:18 /etc/apt/sources.list.d/offline-resources.list
    -rw-r--r-- 1 root root 66 Sep  6 14:51 /etc/apt/sources.list.d/offline-resources.list.bak
    root@node2:~# apt update
    Err:1 http://192.168.100.25:8080/ubuntu/amd64 bionic/ InRelease
    Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    All packages are up to date.
    W: Failed to fetch http://192.168.100.25:8080/ubuntu/amd64/bionic/InRelease  Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
    W: Some index files failed to download. They have been ignored, or old ones used instead.
divfor commented 3 years ago
  1. 找不到Packages目录出错,实际目录是这样的:
    
    root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# tree -L 2
    .
    ├── archive.ubuntu.com
    │   └── ubuntu
    ├── download.docker.com
    │   └── linux
    └── Packages.gz

4 directories, 1 file

divfor commented 3 years ago

终于成功了一次,删除了cgroupv2,重启

===============================================================================
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates --------------------------------------------------------------------------- 4.58s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources -------------------------------------------------------------------------------------- 4.52s
download : download | Download files / images ---------------------------------------------------------------------------------------------------- 0.81s
Gather minimal facts ----------------------------------------------------------------------------------------------------------------------------- 0.65s
Gather necessary facts (hardware) ---------------------------------------------------------------------------------------------------------------- 0.60s
kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver ------------------------------------------------------------------------------ 0.53s
Gather necessary facts (network) ----------------------------------------------------------------------------------------------------------------- 0.42s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm CoreDNS ------------------------------------------------------------------------------- 0.35s
kubernetes-apps/ansible : Kubernetes Apps | Register coredns deployment annotation `createdby` --------------------------------------------------- 0.31s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm Kube-DNS service ---------------------------------------------------------------------- 0.24s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down nodelocaldns Template ----------------------------------------------------------------------- 0.19s
kubernetes-apps/metallb : Kubernetes Apps | Install and configure MetalLB ------------------------------------------------------------------------ 0.18s
kubernetes-apps/metallb : Kubernetes Apps | Set apparmor_enabled --------------------------------------------------------------------------------- 0.14s
kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames ---------------------------------------------------------------------- 0.14s
kubernetes-apps/external_cloud_controller/openstack : External OpenStack Cloud Controller | Generate Manifests ----------------------------------- 0.13s
kubernetes-apps/container_engine_accelerator/nvidia_gpu : Container Engine Acceleration Nvidia GPU | Create manifests for nvidia accelerators ---- 0.11s
kubernetes-apps/csi_driver/cinder : Cinder CSI Driver | Write cacert file ------------------------------------------------------------------------ 0.10s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ---------------------------------------------------------------------------------- 0.10s
download : prep_download | On localhost, check if passwordless root is possible ------------------------------------------------------------------ 0.10s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down Secondary CoreDNS Template ------------------------------------------------------------------ 0.09s
✔ ######  05-cluster-apps successfully installed  ######
✔ ######  kubernetes cluster successfully installed  ######
divfor commented 3 years ago

这是我目前还需要手动解决

#!/bin/bash

# one shot
# iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT
# iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT
# for i in nodes; do ssh $i modprobe br_netfilter; done

for h in x99u d9020 fredvb; do
  ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*'
done

很奇怪nerdctl拉起的两个容器端口8080 443为啥不给加iptables通过

muzi502 commented 3 years ago

这是我目前还需要手动解决

#!/bin/bash

# one shot
# iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT
# iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT

for h in x99u d9020 fredvb; do
  ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*'
done

这个后期会修复,移除的时候会清理这些存留的文件