hashicorp / consul-helm

Helm chart to install Consul and other associated components.
Mozilla Public License 2.0
422 stars 388 forks source link

consul server status keep pending #801

Closed somethingwentwell closed 3 years ago

somethingwentwell commented 3 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Tried to create consul service using helm in Azure VM (1 Ubuntu 18.04 as master, 1 Ubuntu 18.04 as worker). But the consul server keep pending status causes consul agent unable to connect server

Reproduction Steps

  1. When running helm install with the following values.yml:
    global:
    domain: consul
    datacenter: dc1
    ui:
    service:
    type: 'NodePort'
    server:
    replicas: 1
    bootstrapExpect: 1
    affinity: ""
  2. View error

Logs

consul agent logs.

Logs ``` ==> Starting Consul agent... Version: '1.9.2' Node ID: 'd0d09346-509c-9ed9-c647-2c7e89ea30ba' Node name: 'k8s-node-01' Datacenter: 'dc1' (Segment: '') Server: false (Bootstrap: false) Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) Cluster Addr: 10.44.0.1 (LAN: 8301, WAN: 8302) Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false ==> Log data will now stream in as it occurs: 2021-02-02T09:47:26.453Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: k8s-node-01 10.44.0.1 2021-02-02T09:47:26.454Z [INFO] agent.router: Initializing LAN area manager 2021-02-02T09:47:26.454Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=udp 2021-02-02T09:47:26.454Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=tcp 2021-02-02T09:47:26.454Z [INFO] agent: Starting server: address=[::]:8500 network=tcp protocol=http 2021-02-02T09:47:26.455Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them. 2021-02-02T09:47:26.455Z [INFO] agent: started state syncer ==> Consul agent running! 2021-02-02T09:47:26.455Z [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" 2021-02-02T09:47:26.455Z [INFO] agent: Joining cluster...: cluster=LAN 2021-02-02T09:47:26.455Z [INFO] agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.default.svc:8301] 2021-02-02T09:47:26.456Z [INFO] agent: Started gRPC server: address=[::]:8502 network=tcp 2021-02-02T09:47:26.456Z [WARN] agent.router.manager: No servers available 2021-02-02T09:47:26.456Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers" 2021-02-02T09:47:26.695Z [WARN] agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.default.svc:8301: lookup consul-consul-server-0.consul-consul-server.default.svc on 10.96.0.10:53: no such host 2021-02-02T09:47:26.695Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred: * Failed to resolve consul-consul-server-0.consul-consul-server.default.svc:8301: lookup consul-consul-server-0.consul-consul-server.default.svc on 10.96.0.10:53: no such host " 2021-02-02T09:47:26.695Z [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error= 2021-02-02T09:47:30.675Z [INFO] agent: Newer Consul version available: new_version=1.9.3 current_version=1.9.2 2021-02-02T09:47:32.722Z [WARN] agent.router.manager: No servers available 2021-02-02T09:47:32.722Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:33602 error="No known Consul servers" 2021-02-02T09:47:42.718Z [WARN] agent.router.manager: No servers available 2021-02-02T09:47:42.718Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:33632 error="No known Consul servers" 2021-02-02T09:47:47.570Z [WARN] agent.router.manager: No servers available 2021-02-02T09:47:47.570Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers" 2021-02-02T09:47:52.715Z [WARN] agent.router.manager: No servers available 2021-02-02T09:47:52.715Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:33644 error="No known Consul servers" 2021-02-02T09:47:56.695Z [INFO] agent: (LAN) joining: lan_addresses=[consul-consul-server-0.consul-consul-server.default.svc:8301] 2021-02-02T09:47:56.878Z [WARN] agent.client.memberlist.lan: memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.default.svc:8301: lookup consul-consul-server-0.consul-consul-server.default.svc on 10.96.0.10:53: no such host 2021-02-02T09:47:56.878Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred: * Failed to resolve consul-consul-server-0.consul-consul-server.default.svc:8301: lookup consul-consul-server-0.consul-consul-server.default.svc on 10.96.0.10:53: no such host " 2021-02-02T09:47:56.878Z [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error= 2021-02-02T09:48:02.714Z [WARN] agent.router.manager: No servers available 2021-02-02T09:48:02.714Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:33666 error="No known Consul servers" 2021-02-02T09:48:12.726Z [WARN] agent.router.manager: No servers available 2021-02-02T09:48:12.726Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:33688 error="No known Consul servers" 2021-02-02T09:48:17.041Z [WARN] agent.router.manager: No servers available 2021-02-02T09:48:17.041Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers" ```

Expected behavior

1 consul server and 1 consul agent running

Environment details

If not already included, please provide the following:

Any other information you can provide about the environment/deployment.

Additional Context

Additional context on the problem. image

ndhanushkodi commented 3 years ago

Hi @somethingwentwell! It's possible that your Pods are running into cpu limitation. Do you have at least 4vcpu and 4gb ram per node you are running on?

somethingwentwell commented 3 years ago

Hi @somethingwentwell! It's possible that your Pods are running into cpu limitation. Do you have at least 4vcpu and 4gb ram per node you are running on? Hi @ndhanushkodi thanks for your reply, I just changed my master node from D2s_v3 (2core, 8GB memory) to D4s_v3 (4core, 16GB memory) but seems still not working image

ndhanushkodi commented 3 years ago

@somethingwentwell can you show the output of kubectl describe pod consul-consul-server-0? The events section in the output may tell us why.

somethingwentwell commented 3 years ago

@somethingwentwell can you show the output of kubectl describe pod consul-consul-server-0? The events section in the output may tell us why.

Here's the output, many thanks @ndhanushkodi

NAME                     READY   STATUS    RESTARTS   AGE
consul-consul-server-0   0/1     Pending   0          5s
consul-consul-tjbx5      0/1     Running   0          5s
warren@k8s-master:~$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
consul-consul-server-0   0/1     Pending   0          32m
consul-consul-tjbx5      0/1     Running   0          32m
warren@k8s-master:~$ kubectl describe pod consul-consul-server-0
Name:           consul-consul-server-0
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=consul
                chart=consul-helm
                component=server
                controller-revision-hash=consul-consul-server-5687df56c6
                hasDNS=true
                release=consul
                statefulset.kubernetes.io/pod-name=consul-consul-server-0
Annotations:    consul.hashicorp.com/config-checksum: c7221aa8a68874d8bd000fb6ef                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              af58f50108454b578402f0717aa4538c04ea5c
                consul.hashicorp.com/connect-inject: false
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/consul-consul-server
Containers:
  consul:
    Image:       hashicorp/consul:1.9.2
    Ports:       8500/TCP, 8301/TCP, 8301/UDP, 8302/TCP, 8300/TCP, 8600/TCP, 860                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              0/UDP
    Host Ports:  0/TCP, 0/TCP, 0/UDP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="consul-consul"

      exec /bin/consul agent \
        -advertise="${ADVERTISE_IP}" \
        -bind=0.0.0.0 \
        -bootstrap-expect=1 \
        -client=0.0.0.0 \
        -config-dir=/consul/config \
        -datacenter=dc1 \
        -data-dir=/consul/data \
        -domain=consul \
        -hcl="connect { enabled = true }" \
        -ui \
        -retry-join="${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ESPACE}.svc:8301" \
        -serf-lan-port=8301 \
        -server

    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Readiness:  exec [/bin/sh -ec curl http://127.0.0.1:8500/v1/status/leader \
2>/dev/null | grep -E '".+"'
] delay=5s timeout=5s period=3s #success=1 #failure=2
    Environment:
      ADVERTISE_IP:   (v1:status.podIP)
      POD_IP:         (v1:status.podIP)
      NAMESPACE:     default (v1:metadata.namespace)
    Mounts:
      /consul/config from config (rw)
      /consul/data from data-default (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-consul-server-to                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ken-qzhk4 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  data-default:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               the same namespace)
    ClaimName:  data-default-consul-consul-server-0
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      consul-consul-server-config
    Optional:  false
  consul-consul-server-token-qzhk4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-consul-server-token-qzhk4
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  36m   default-scheduler  0/2 nodes are available: 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  36m   default-scheduler  0/2 nodes are available: 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               pod has unbound immediate PersistentVolumeClaims.
ndhanushkodi commented 3 years ago

@somethingwentwell This line: Warning FailedScheduling 36m default-scheduler 0/2 nodes are available: 2 tells me that maybe there's some issues with the nodes. If you were deploy any Pod on this cluster, does it succeed? And does kubectl get nodes show all the nodes as Ready?

somethingwentwell commented 3 years ago

@somethingwentwell This line: Warning FailedScheduling 36m default-scheduler 0/2 nodes are available: 2 tells me that maybe there's some issues with the nodes. If you were deploy any Pod on this cluster, does it succeed? And does kubectl get nodes show all the nodes as Ready?

@ndhanushkodi yes, the nodes are ready and the deployment of nginx is successful.

warren@k8s-master:~$ kubectl get node
NAME          STATUS   ROLES                  AGE   VERSION
k8s-master    Ready    control-plane,master   18h   v1.20.2
k8s-node-01   Ready    <none>                 17h   v1.20.2
warren@k8s-master:~$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
consul-consul-server-0   0/1     Pending   0          44m
consul-consul-tjbx5      0/1     Running   0          44m
warren@k8s-master:~$ kubectl apply -f nginx-deployment.yml
deployment.apps/nginx-deployment created
warren@k8s-master:~$ kubectl  get po
NAME                                READY   STATUS    RESTARTS   AGE
consul-consul-server-0              0/1     Pending   0          45m
consul-consul-tjbx5                 0/1     Running   0          45m
nginx-deployment-66b6c48dd5-bkvbz   1/1     Running   0          7s
nginx-deployment-66b6c48dd5-fkqn6   1/1     Running   0          7s
nginx-deployment-66b6c48dd5-srk6x   1/1     Running   0          7s
ndhanushkodi commented 3 years ago

Oh, I didn't catch this earlier, but the server events say "pod has unbound immediate PersistentVolumeClaims." Its possible the issue is with the PVCs. Can you provide the output of kubectl get pvc, and kubectl describe pvc.

whiskeysierra commented 3 years ago

I saw this particular error when I tried to run a server deployment on a kubernetes cluster with just two nodes. The scheduler didn't succeed to meet the anti affinity constraint, but the reported error was an unbound pvc.

somethingwentwell commented 3 years ago

kubectl describe pvc

Really appreciate for your help and here's the output.

warren@k8s-master:~$ kubectl get pvc
NAME                                     STATUS    VOLUME   CAPACITY   ACCESS MO                                        DES   STORAGECLASS   AGE
data-consul-0                            Pending                                                                                             42h
data-consul-1                            Pending                                                                                             42h
data-consul-2                            Pending                                                                                             42h
data-default-consul-consul-server-0      Pending                                                                                             42h
data-default-consul-consul-server-1      Pending                                                                                             42h
data-default-consul-consul-server-2      Pending                                                                                             42h
data-default-consul-server-0             Pending                                                                                             42h
data-default-hashicorp-consul-server-0   Pending                                                                                             42h
data-default-hashicorp-consul-server-1   Pending                                                                                             42h
data-default-hashicorp-consul-server-2   Pending                                                                                             42h
warren@k8s-master:~$ kubectl describe pvc
Name:          data-consul-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=consul
               app.kubernetes.io/name=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-consul-1
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=consul
               app.kubernetes.io/name=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-consul-2
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=consul
               app.kubernetes.io/name=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-consul-consul-server-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       consul-consul-server-0
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2822 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-consul-consul-server-1
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-consul-consul-server-2
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=consul
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2822 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-consul-server-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=hashicorp
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-hashicorp-consul-server-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=hashicorp
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2822 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-hashicorp-consul-server-1
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=hashicorp
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

Name:          data-default-hashicorp-consul-server-2
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app=consul
               chart=consul-helm
               component=server
               hasDNS=true
               release=hashicorp
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  4s (x2821 over 11h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set
somethingwentwell commented 3 years ago

New update: I just found I need to pre-create PVC in self hosted k8s (https://www.consul.io/docs/k8s/installation/platforms/self-hosted-kubernetes). I followed and create pv and pvc for data-default-consul-consul-server-0 and recreate consul using helm. This time got CrashLoopBackOff instead of Pending now. Here are the outputs.

PV & PVC:

warren@k8s-master:~$ kubectl get pv
NAME                                  CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                         STORAGECLASS   REASON   AGE
data-default-consul-consul-server-0   10Gi       RWO            Retain           Bound    default/data-default-consul-consul-server-0   manual                  8m55s
warren@k8s-master:~$ kubectl get pvc
NAME                                  STATUS   VOLUME                                CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-default-consul-consul-server-0   Bound    data-default-consul-consul-server-0   10Gi       RWO            manual         8m55s

kubectl get po:

warren@k8s-master:~$ kubectl get po
NAME                                READY   STATUS             RESTARTS   AGE
consul-consul-8lnvv                 0/1     Running            0          8m5s
consul-consul-server-0              0/1     CrashLoopBackOff   6          8m5s
nginx-deployment-66b6c48dd5-bkvbz   1/1     Running            0          25h
nginx-deployment-66b6c48dd5-fkqn6   1/1     Running            0          25h
nginx-deployment-66b6c48dd5-srk6x   1/1     Running            0          25h

kubectl describe po:

warren@k8s-master:~$ kubectl describe po consul-consul-server-0
Name:         consul-consul-server-0
Namespace:    default
Priority:     0
Node:         k8s-node-01/172.16.1.5
Start Time:   Thu, 04 Feb 2021 04:33:25 +0000
Labels:       app=consul
              chart=consul-helm
              component=server
              controller-revision-hash=consul-consul-server-5687df56c6
              hasDNS=true
              release=consul
              statefulset.kubernetes.io/pod-name=consul-consul-server-0
Annotations:  consul.hashicorp.com/config-checksum: c7221aa8a68874d8bd000fb6efaf58f50108454b578402f0717aa4538c04ea5c
              consul.hashicorp.com/connect-inject: false
Status:       Running
IP:           10.44.0.1
IPs:
  IP:           10.44.0.1
Controlled By:  StatefulSet/consul-consul-server
Containers:
  consul:
    Container ID:  docker://152862e43bb226c3321b9af557aa6b5bb4cded7ef4d3112e9885cc817cb48997
    Image:         hashicorp/consul:1.9.2
    Image ID:      docker-pullable://hashicorp/consul@sha256:bf2ade1fb1766aac082bf48c573d2ffca71327a7973203191da858e1d4f1e404
    Ports:         8500/TCP, 8301/TCP, 8301/UDP, 8302/TCP, 8300/TCP, 8600/TCP, 8600/UDP
    Host Ports:    0/TCP, 0/TCP, 0/UDP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="consul-consul"

      exec /bin/consul agent \
        -advertise="${ADVERTISE_IP}" \
        -bind=0.0.0.0 \
        -bootstrap-expect=1 \
        -client=0.0.0.0 \
        -config-dir=/consul/config \
        -datacenter=dc1 \
        -data-dir=/consul/data \
        -domain=consul \
        -hcl="connect { enabled = true }" \
        -ui \
        -retry-join="${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc:8301" \
        -serf-lan-port=8301 \
        -server

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 04 Feb 2021 04:39:15 +0000
      Finished:     Thu, 04 Feb 2021 04:39:15 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Readiness:  exec [/bin/sh -ec curl http://127.0.0.1:8500/v1/status/leader \
2>/dev/null | grep -E '".+"'
] delay=5s timeout=5s period=3s #success=1 #failure=2
    Environment:
      ADVERTISE_IP:   (v1:status.podIP)
      POD_IP:         (v1:status.podIP)
      NAMESPACE:     default (v1:metadata.namespace)
    Mounts:
      /consul/config from config (rw)
      /consul/data from data-default (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-consul-server-token-kpqbm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data-default:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-default-consul-consul-server-0
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      consul-consul-server-config
    Optional:  false
  consul-consul-server-token-kpqbm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-consul-server-token-kpqbm
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  9m43s                   default-scheduler  Successfully assigned default/consul-consul-server-0 to k8s-node-01
  Normal   Pulled     8m7s (x5 over 9m43s)    kubelet            Container image "hashicorp/consul:1.9.2" already present on machine
  Normal   Created    8m7s (x5 over 9m42s)    kubelet            Created container consul
  Normal   Started    8m7s (x5 over 9m42s)    kubelet            Started container consul
  Warning  BackOff    4m41s (x26 over 9m41s)  kubelet            Back-off restarting failed container
ndhanushkodi commented 3 years ago

Can you provide the consul server logs as well, since it's events say Back-off restarting failed container?

somethingwentwell commented 3 years ago

Hi @ndhanushkodi , here it is

warren@k8s-master:~$ kubectl logs consul-consul-server-0
==> failed to setup node ID: failed to write NodeID to disk: open /consul/data/node-id: permission denied
lkysow commented 3 years ago

I think the persistent volume's permissions are wrong. Here's what they are on GKE:

 $ ls -la /consul/data/
total 40
drwxrwsr-x    5 root     consul        4096 Feb  3 00:14 .
drwxr-xr-x    4 consul   consul        4096 Jan 20 23:59 ..
-rw-r--r--    1 consul   consul         394 Feb  3 00:14 checkpoint-signature
drwxrws---    2 root     consul       16384 Feb  3 00:14 lost+found
-rw-------    1 consul   consul          36 Feb  3 00:14 node-id
drwxr-sr-x    3 consul   consul        4096 Feb  3 00:14 raft
drwxr-sr-x    2 consul   consul        4096 Feb  3 00:14 serf
somethingwentwell commented 3 years ago

@lkysow @ndhanushkodi I recreate the pv with host path /consul/data, the yaml file as below:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-default-consul-consul-server-0
  labels:
    type: local
spec:
  storageClassName: ""
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/consul/data"

Then try to chmod 777 to /consul/data but it still shows the same error message.

warren@k8s-master:~$ ls -la /consul/data
total 12
drwxrwxrwx 3 root   root   4096 Feb  8 02:14 .
drwxr-xr-x 3 root   root   4096 Feb  8 02:11 ..
drwxrwxrwx 2 warren warren 4096 Feb  8 02:14 node-id
warren@k8s-master:~$ kubectl get po
NAME                                READY   STATUS             RESTARTS   AGE
consul-consul-9kfb5                 0/1     Running            0          4m59s
consul-consul-server-0              0/1     CrashLoopBackOff   5          4m59s
nginx-deployment-66b6c48dd5-bkvbz   1/1     Running            0          4d23h
nginx-deployment-66b6c48dd5-fkqn6   1/1     Running            0          4d23h
nginx-deployment-66b6c48dd5-srk6x   1/1     Running            0          4d23h
warren@k8s-master:~$ kubectl logs consul-consul-server-0
==> failed to setup node ID: failed to write NodeID to disk: open /consul/data/node-id: permission denied
somethingwentwell commented 3 years ago

I edit the helm variable security context to be root like below and it's working.

helm values

global:
  domain: consul
  datacenter: dc1
ui:
  service:
    type: 'NodePort'
server:
  replicas: 1
  bootstrapExpect: 1
  affinity: ""
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: false
    runAsUser: 0

output

warren@k8s-master:~$ kubectl get po
NAME                                READY   STATUS             RESTARTS   AGE
consul-consul-server-0              1/1     Running            0          67s
consul-consul-v77nd                 1/1     Running            0          67s
nginx-deployment-66b6c48dd5-bkvbz   1/1     Running            0          5d2h
nginx-deployment-66b6c48dd5-fkqn6   1/1     Running            0          5d2h
nginx-deployment-66b6c48dd5-srk6x   1/1     Running            0          5d2h

But it seems not a good solution for security. Anyone can have a better solution?

lkysow commented 3 years ago

I notice yours is owned by root:root but my example has root:consul.

Mine:

$ ls -la /consul/data/
total 40
drwxrwsr-x    5 root     consul        4096 Feb  3 00:14 .
drwxr-xr-x    4 consul   consul        4096 Jan 20 23:59 ..

Yours:

ls -la /consul/data
total 12
drwxrwxrwx 3 root   root   4096 Feb  8 02:14 .
drwxr-xr-x 3 root   root   4096 Feb  8 02:11 ..
lkysow commented 3 years ago

I'm going to close this for now since we haven't heard back but if you get back to us we can re-open.