安装新集群的时候报错：[ETCDConfigureModule] Health check on exist etcd

nejinn commented 1 month ago

What is version of KubeKey has the issue?

v3.0.13

What is your os environment?

centos7

KubeKey config file

apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: sample
spec:
  hosts:
  - {name: master1, address: 192.168.100.156, internalAddress: 10.75.142.156, user: , password: ""}
  - {name: master2, address: 192.168.100.157, internalAddress: 10.75.142.157, user: , password: ""}
  - {name: master3, address: 192.168.100.158, internalAddress: 10.75.142.158, user: , password: ""}
  - {name: node1, address: 192.168.100.159, internalAddress: 10.75.142.159, user: , password: ""}
  - {name: node2, address: 192.168.100.160, internalAddress: 10.75.142.160, user: , password: ""}
  - {name: node3, address: 192.168.100.161, internalAddress: 10.75.142.161, user: , password: ""}
  roleGroups:
    etcd:
    - master1
    - master2
    - master3
    control-plane: 
    - master1
    - master2
    - master3
    worker:
    - node1
    - node2
    - node3
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers 
    # internalLoadbalancer: haproxy

    domain: lb.kubesphere.local
    address: 192.168.100.173
    port: 6443
  kubernetes:
    version: v1.23.15
    clusterName: cluster.local
    autoRenewCerts: true
    containerManager: docker
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    privateRegistry: ""
    namespaceOverride: ""
    registryMirrors: []
    insecureRegistries: []
  addons: []

---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
  name: ks-installer
  namespace: kubesphere-system
  labels:
    version: v3.4.1
spec:
  persistence:
    storageClass: ""
  authentication:
    jwtSecret: ""
  local_registry: ""
  # dev_tag: ""
  etcd:
    monitoring: false
    endpointIps: localhost
    port: 2379
    tlsEnable: true
  common:
    core:
      console:
        enableMultiLogin: true
        port: 30880
        type: NodePort
    # apiserver:
    #  resources: {}
    # controllerManager:
    #  resources: {}
    redis:
      enabled: false
      enableHA: false
      volumeSize: 2Gi
    openldap:
      enabled: false
      volumeSize: 2Gi
    minio:
      volumeSize: 20Gi
    monitoring:
      # type: external
      endpoint: http://prometheus-operated.kubesphere-monitoring-system.svc:9090
      GPUMonitoring:
        enabled: false
    gpu:
      kinds:
      - resourceName: "nvidia.com/gpu"
        resourceType: "GPU"
        default: true
    es:
      # master:
      #   volumeSize: 4Gi
      #   replicas: 1
      #   resources: {}
      # data:
      #   volumeSize: 20Gi
      #   replicas: 1
      #   resources: {}
      enabled: false
      logMaxAge: 7
      elkPrefix: logstash
      basicAuth:
        enabled: false
        username: ""
        password: ""
      externalElasticsearchHost: ""
      externalElasticsearchPort: ""
    opensearch:
      # master:
      #   volumeSize: 4Gi
      #   replicas: 1
      #   resources: {}
      # data:
      #   volumeSize: 20Gi
      #   replicas: 1
      #   resources: {}
      enabled: true
      logMaxAge: 7
      opensearchPrefix: whizard
      basicAuth:
        enabled: true
        username: "admin"
        password: "admin"
      externalOpensearchHost: ""
      externalOpensearchPort: ""
      dashboard:
        enabled: false
  alerting:
    enabled: false
    # thanosruler:
    #   replicas: 1
    #   resources: {}
  auditing:
    enabled: false
    # operator:
    #   resources: {}
    # webhook:
    #   resources: {}
  devops:
    enabled: false
    jenkinsCpuReq: 0.5
    jenkinsCpuLim: 1
    jenkinsMemoryReq: 4Gi
    jenkinsMemoryLim: 4Gi
    jenkinsVolumeSize: 16Gi
  events:
    enabled: false
    # operator:
    #   resources: {}
    # exporter:
    #   resources: {}
    ruler:
      enabled: true
      replicas: 2
    #   resources: {}
  logging:
    enabled: false
    logsidecar:
      enabled: true
      replicas: 2
      # resources: {}
  metrics_server:
    enabled: false
  monitoring:
    storageClass: ""
    node_exporter:
      port: 9100
      # resources: {}
    # kube_rbac_proxy:
    #   resources: {}
    # kube_state_metrics:
    #   resources: {}
    # prometheus:
    #   replicas: 1
    #   volumeSize: 20Gi
    #   resources: {}
    #   operator:
    #     resources: {}
    # alertmanager:
    #   replicas: 1
    #   resources: {}
    # notification_manager:
    #   resources: {}
    #   operator:
    #     resources: {}
    #   proxy:
    #     resources: {}
    gpu:
      nvidia_dcgm_exporter:
        enabled: false
        # resources: {}
  multicluster:
    clusterRole: none
  network:
    networkpolicy:
      enabled: false
    ippool:
      type: none
    topology:
      type: none
  openpitrix:
    store:
      enabled: false
  servicemesh:
    enabled: false
    istio:
      components:
        ingressGateways:
        - name: istio-ingressgateway
          enabled: false
        cni:
          enabled: false
  edgeruntime:
    enabled: false
    kubeedge:
      enabled: false
      cloudCore:
        cloudHub:
          advertiseAddress:
            - ""
        service:
          cloudhubNodePort: "30000"
          cloudhubQuicNodePort: "30001"
          cloudhubHttpsNodePort: "30002"
          cloudstreamNodePort: "30003"
          tunnelNodePort: "30004"
        # resources: {}
        # hostNetWork: false
      iptables-manager:
        enabled: true
        mode: "external"
        # resources: {}
      # edgeService:
      #   resources: {}
  gatekeeper:
    enabled: false
    # controller_manager:
    #   resources: {}
    # audit:
    #   resources: {}
  terminal:
    timeout: 600

A clear and concise description of what happend.

[ETCDConfigureModule] Health check on exist etcd

Relevant log output

08:50:56 CST [InstallETCDBinaryModule] Generate etcd service
08:50:56 CST success: [master1]
08:50:56 CST success: [master3]
08:50:56 CST success: [master2]
08:50:56 CST [InstallETCDBinaryModule] Generate access address
08:50:56 CST skipped: [master3]
08:50:56 CST skipped: [master1]
08:50:56 CST success: [master2]
08:50:56 CST [ETCDConfigureModule] Health check on exist etcd
08:50:56 CST message: [master3]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master3.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master3-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://10.75.142.156:2379,https://10.75.142.157:2379,https://10.75.142.158:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.75.142.156:2379: connect: connection refused
; error #1: dial tcp 10.75.142.157:2379: connect: connection refused
; error #2: dial tcp 10.75.142.158:2379: connect: connection refused

error #0: dial tcp 10.75.142.156:2379: connect: connection refused
error #1: dial tcp 10.75.142.157:2379: connect: connection refused
error #2: dial tcp 10.75.142.158:2379: connect: connection refused: Process exited with status 1
08:50:56 CST retry: [master3]
08:50:56 CST message: [master2]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master2.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master2-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://10.75.142.156:2379,https://10.75.142.157:2379,https://10.75.142.158:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.75.142.158:2379: connect: connection refused
; error #1: dial tcp 10.75.142.156:2379: connect: connection refused
; error #2: dial tcp 10.75.142.157:2379: connect: connection refused

error #0: dial tcp 10.75.142.158:2379: connect: connection refused
error #1: dial tcp 10.75.142.156:2379: connect: connection refused
error #2: dial tcp 10.75.142.157:2379: connect: connection refused: Process exited with status 1
08:50:56 CST retry: [master2]
08:51:02 CST message: [master3]
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-master3.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-master3-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://10.75.142.156:2379,https://10.75.142.157:2379,https://10.75.142.158:2379 cluster-health | grep -q 'cluster is healthy'" 
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.75.142.158:2379: connect: connection refused
; error #1: dial tcp 10.75.142.157:2379: connect: connection refused
; error #2: dial tcp 10.75.142.156:2379: connect: connection refused

我看了一下源码，到了ETCDConfigureModule。应该就是到了上图红框里这部分。

func (e *ConfigureModule) Init() {
    e.Name = "ETCDConfigureModule"
    e.Desc = "Configure ETCD cluster"

    if v, ok := e.PipelineCache.Get(common.ETCDCluster); ok {
        cluster := v.(*EtcdCluster)
        if !cluster.clusterExist {
            e.Tasks = handleNewCluster(e)
        } else {
            e.Tasks = handleExistCluster(e)
        }
    }
}

因为是新建集群，那么我们就认为是走的handleNewCluster

handleNewCluster的源码如下：

func handleNewCluster(c *ConfigureModule) []task.Interface {

    existETCDHealthCheck := &task.RemoteTask{
        Name:     "ExistETCDHealthCheck",
        Desc:     "Health check on exist etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Prepare:  new(NodeETCDExist),
        Action:   new(HealthCheck),
        Parallel: true,
        Retry:    20,
    }

    generateETCDConfig := &task.RemoteTask{
        Name:     "GenerateETCDConfig",
        Desc:     "Generate etcd.env config on new etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Prepare:  &NodeETCDExist{Not: true},
        Action:   new(GenerateConfig),
        Parallel: false,
    }

    allRefreshETCDConfig := &task.RemoteTask{
        Name:     "AllRefreshETCDConfig",
        Desc:     "Refresh etcd.env config on all etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Action:   new(RefreshConfig),
        Parallel: false,
    }

    restart := &task.RemoteTask{
        Name:     "RestartETCD",
        Desc:     "Restart etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Prepare:  &NodeETCDExist{Not: true},
        Action:   new(RestartETCD),
        Parallel: true,
    }

    allETCDNodeHealthCheck := &task.RemoteTask{
        Name:     "AllETCDNodeHealthCheck",
        Desc:     "Health check on all etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Action:   new(HealthCheck),
        Parallel: true,
        Retry:    20,
    }

    refreshETCDConfigToExist := &task.RemoteTask{
        Name:     "RefreshETCDConfigToExist",
        Desc:     "Refresh etcd.env config to exist mode on all etcd",
        Hosts:    c.Runtime.GetHostsByRole(common.ETCD),
        Action:   &RefreshConfig{ToExisting: true},
        Parallel: false,
    }

    tasks := []task.Interface{
        existETCDHealthCheck,
        generateETCDConfig,
        allRefreshETCDConfig,
        restart,
        allETCDNodeHealthCheck,
        refreshETCDConfigToExist,
        allETCDNodeHealthCheck,
    }
    return tasks
}

existETCDHealthCheck 按道理应该是这个handleNewCluster里放到restart后才执行的才对吧。新建集群，在ETCDConfigureModule之前执行了InstallETCDBinaryModule，那么代表etcd是安装了，但是InstallETCDBinaryModule并没有生成 etcd.env。所以接下来在handleNewCluster即ConfigureModule中的执行顺序应该是

    tasks := []task.Interface{
        generateETCDConfig,
        allRefreshETCDConfig,
        restart,
        existETCDHealthCheck,
        allETCDNodeHealthCheck,
        refreshETCDConfigToExist,
        allETCDNodeHealthCheck,
    }

其实 existETCDHealthCheck,在新建集群是不是都没必要了。

然后我看了一下etcd.service

[Unit]
Description=etcd
After=network.target

[Service]
User=root
Type=notify
EnvironmentFile=/etc/etcd.env
ExecStart=/usr/local/bin/etcd
NotifyAccess=all
RestartSec=10s
LimitNOFILE=40000
Restart=always

[Install]
WantedBy=multi-user.target

配置文件在 /etc/etcd.env

查了一下这个，文件不存在。配置文件不存在，就代表InstallETCDBinaryModule安装etcd之后，是无法启动etcd的，导致后面ETCDConfigureModule中existETCDHealthCheck是会报错的。因为etcd没启动。看一下当前节点etcd的状态

● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Jul 11 23:06:30 master1 systemd[1]: Unit etcd.service entered failed state.
Jul 11 23:06:30 master1 systemd[1]: etcd.service failed.
Jul 11 23:06:40 master1 systemd[1]: etcd.service holdoff time over, scheduling restart.
Jul 11 23:06:40 master1 systemd[1]: Stopped etcd.
Jul 11 23:06:40 master1 systemd[1]: Failed to load environment files: No such file or directory
Jul 11 23:06:40 master1 systemd[1]: etcd.service failed to run 'start' task: No such file or directory
Jul 11 23:06:40 master1 systemd[1]: Failed to start etcd.
Jul 11 23:06:40 master1 systemd[1]: Unit etcd.service entered failed state.
Jul 11 23:06:40 master1 systemd[1]: etcd.service failed.
Jul 11 23:06:41 master1 systemd[1]: Stopped etcd.

那么问题出来了，这个问题出现的原因是在没用生成 etcd.env的前提下，检查了etcd的健康度。

是否可以出一个紧急的版本修复一下这问题？

nejinn commented 1 month ago

想了想，打算自己加etcd的配置文件。看了一下kubekey生成etcd配置文件的template。如下：

/*
 Copyright 2021 The KubeSphere Authors.

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

package templates

import (
    "text/template"

    "github.com/lithammer/dedent"
)

// EtcdEnv defines the template of etcd's env.
var EtcdEnv = template.Must(template.New("etcd.env").Parse(
    dedent.Dedent(`# Environment file for etcd {{ .Tag }}
{{- if .DataDir }}
ETCD_DATA_DIR={{ .DataDir }}
{{- else }}
ETCD_DATA_DIR=/var/lib/etcd
{{- end }}
ETCD_ADVERTISE_CLIENT_URLS=https://{{ .Ip }}:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://{{ .Ip }}:2380
ETCD_INITIAL_CLUSTER_STATE={{ .State }}
ETCD_METRICS=basic
ETCD_LISTEN_CLIENT_URLS=https://{{ .Ip }}:2379,https://127.0.0.1:2379
ETCD_INITIAL_CLUSTER_TOKEN=k8s_etcd
ETCD_LISTEN_PEER_URLS=https://{{ .Ip }}:2380
ETCD_NAME={{ .Name }}
ETCD_PROXY=off
ETCD_ENABLE_V2=true
ETCD_INITIAL_CLUSTER={{ .PeerAddresses }}
{{- if .ElectionTimeout }}
ETCD_ELECTION_TIMEOUT={{ .ElectionTimeout }}
{{- else }}
ETCD_ELECTION_TIMEOUT=5000
{{- end }}
{{- if .HeartbeatInterval }}
ETCD_HEARTBEAT_INTERVAL={{ .HeartbeatInterval }}
{{- else }}
ETCD_HEARTBEAT_INTERVAL=250
{{- end }}
{{- if .CompactionRetention  }}
ETCD_AUTO_COMPACTION_RETENTION={{ .CompactionRetention }}
{{- else }}
ETCD_AUTO_COMPACTION_RETENTION=8
{{- end }}
{{- if .SnapshotCount }}
ETCD_SNAPSHOT_COUNT={{ .SnapshotCount }}
{{- else }}
ETCD_SNAPSHOT_COUNT=10000
{{- end }}
{{- if .Metrics }}
ETCD_METRICS={{ .Metrics }}
{{- end }}
{{- if .QuotaBackendBytes }}
ETCD_QUOTA_BACKEND_BYTES={{ .QuotaBackendBytes }}
{{- end }}
{{- if .MaxRequestBytes }}
ETCD_MAX_REQUEST_BYTES={{ .MaxRequestBytes }}
{{- end }}
{{- if .MaxSnapshots }}
ETCD_MAX_SNAPSHOTS={{ .MaxSnapshots }}
{{- end }}
{{- if .MaxWals }}
ETCD_MAX_WALS={{ .MaxWals }}
{{- end }}
{{- if .LogLevel }}
ETCD_LOG_LEVEL={{ .LogLevel }}
{{- end }}
{{- if .UnsupportedArch }}
ETCD_UNSUPPORTED_ARCH={{ .Arch }}
{{ end }}

# TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ .Hostname }}.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ .Hostname }}-key.pem
ETCD_CLIENT_CERT_AUTH=true

ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ .Hostname }}.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ .Hostname }}-key.pem
ETCD_PEER_CLIENT_CERT_AUTH=true

# CLI settings
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-{{ .Hostname }}-key.pem
ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-{{ .Hostname }}.pem
    `)))

这里有些变量名称，还得斟酌一下，怕填错。

有没有大佬一起填一下这个变量

nejinn commented 1 month ago

这个问题我已经解决了。每一个etcd都加一个空配置文件。文件里加上etcdname，一般都是 ETCD_NAME=etcd-master1

然后运行安装命令。这时候还会报错。到其他节点打开etcd的配置文件，会看到有完整的配置文件了。抄到master1改一改。每一个节点都去启动etcd 再运行安装命令解决

ZoroLH commented 1 month ago

这个问题我已经解决了。每一个etcd都加一个空配置文件。文件里加上etcdname，一般都是 ETCD_NAME=etcd-master1

然后运行安装命令。这时候还会报错。到其他节点打开etcd的配置文件，会看到有完整的配置文件了。抄到master1改一改。每一个节点都去启动etcd 再运行安装命令解决

请问抄到master改一改, 是怎么改一改?

nejinn commented 1 month ago

件不存在，就代表InstallETCDBinaryModule安装etcd之后，是无法启动etcd的，导致后面ETCDConfigureModule中existETCDHealthCheck是会报错的。因为etcd没启动。看一下当前节点etcd的状态
● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Jul 11 23:06:30 master1 systemd[1]: Unit etcd.service entered failed state.
Jul 11 23:06:30 master1 systemd[1]: etcd.service failed.
Jul 11 23:06:40 master1 systemd[1]: etcd.service holdoff time over, scheduling restart.
Jul 11 23:06:40 master1 systemd[1]: Stopped etcd.
Jul 11 23:06:40 master1 systemd[1]: Failed to load environment files: No such file or directory
Jul 11 23:06:40 master1 systemd[1]: etcd.service failed to run 'start' task: No such file or directory
Jul 11 23:06:40 master1 systemd[1]: Failed to start etcd.
Jul 11 23:06:40 master1 systemd[1]: Unit etcd.service entered failed state.
Jul 11 23:06:40 master1 systemd[1]: etcd.service failed.
Jul 11 23:06:41 master1 systemd[1]: Stopped etcd.
那么问题出来了，这个问题出现的原因是在没用生成 etcd.env的前提下，检查了etcd的健康度。

是否可以出一个紧急的版本修复一下这问题？

把etcdname和相对应的ip改掉就行

techMT2024 commented 1 month ago

多谢, 我用最新的pre release的3.1.2安装就没问题了.

kubesphere / kubekey