openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
742 stars 106 forks source link

ERROR ==> Headless service domain does not have an IP per initial member in the cluster #1421

Closed Hr46ph closed 3 weeks ago

Hr46ph commented 1 year ago

Describe the bug I hope this isn't something silly wrong with my cluster ... ;-)

I deployed MayaStor 2.2.0 on a Talos cluster 1.4.5.

When installing Mayastor, the result is that the mayastor-etcd pods crash with an error and restart. ERROR ==> Headless service domain does not have an IP per initial member in the cluster

To Reproduce Steps to reproduce the behavior: I installed 3 control planes and 3 worker nodes on KVM running on Arch Linux.

I followed the basic steps to configure Talos (sans vagrant I deployed the vm's scripted): https://www.talos.dev/v1.4/talos-guides/install/virtualized-platforms/vagrant-libvirt/

Once configured and verified running, I applied the huge pages patch and rebooted the worker nodes.

- op: add
  path: /machine/sysctls
  value:
    vm.nr_hugepages: "1024"
- op: add
  path: /machine/nodeLabels
  value:
    openebs.io/engine: mayastor

I created the namespace maya store and applied privileges:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
  name: mayastor

Applied with: kubectl apply -f mayastor-namespace-privileged.yaml

Next added helm repo and installed as follows: helm install mayastor mayastor/mayastor -n mayastor --version 2.2.0 --set='etcd.persistence.storageClass=manual,loki-stack.loki.persistence.storageClassName=manual'

And observed the etcd nodes crashing and restarting.

NAME                                          READY   STATUS             RESTARTS        AGE
mayastor-agent-core-6846c47db9-7rwgb          0/2     Init:0/1           0               31m
mayastor-agent-ha-node-9qtjj                  0/1     Init:0/1           0               31m
mayastor-agent-ha-node-dmjcp                  0/1     Init:0/1           0               31m
mayastor-agent-ha-node-td759                  0/1     Init:0/1           0               31m
mayastor-api-rest-6f6648d548-ncx8v            0/1     Init:0/2           0               31m
mayastor-csi-controller-866cd589f4-dw4g4      0/3     Init:0/1           0               31m
mayastor-csi-node-2cv48                       2/2     Running            0               31m
mayastor-csi-node-dcwpc                       2/2     Running            0               31m
mayastor-csi-node-s6bdw                       2/2     Running            0               31m
mayastor-etcd-0                               0/1     CrashLoopBackOff   8 (4m49s ago)   31m
mayastor-etcd-1                               0/1     CrashLoopBackOff   8 (4m41s ago)   31m
mayastor-etcd-2                               0/1     Running            9 (5m12s ago)   31m
mayastor-io-engine-2fgrq                      0/2     Pending            0               31m
mayastor-io-engine-4jrdm                      0/2     Pending            0               31m
mayastor-io-engine-t8jzr                      0/2     Pending            0               31m
mayastor-loki-0                               1/1     Running            0               31m
mayastor-obs-callhome-6b7dc5c58c-psvz9        1/1     Running            0               31m
mayastor-operator-diskpool-64ccd7c7cc-kc9k6   0/1     Init:0/2           0               31m
mayastor-promtail-846sr                       1/1     Running            0               31m
mayastor-promtail-9js4q                       1/1     Running            0               31m
mayastor-promtail-zvjnf                       1/1     Running            0               31m

Expected behavior Running mayastor pods ready to configure.

OS info (please complete the following information):

Additional context https://github.com/openebs/mayastor/issues/1368

Logs and other output: kubectl logs -n mayastor mayastor-etcd-0 -f

Defaulted container "etcd" out of: etcd, volume-permissions (init)
etcd 15:34:14.18 
etcd 15:34:14.18 Welcome to the Bitnami etcd container
etcd 15:34:14.19 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 15:34:14.20 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 15:34:14.20 
etcd 15:34:14.21 INFO  ==> ** Starting etcd setup **
etcd 15:34:14.26 INFO  ==> Validating settings in ETCD_* env vars..
etcd 15:34:14.27 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 15:34:14.28 INFO  ==> Initializing etcd
etcd 15:34:14.29 INFO  ==> Generating etcd config file using env variables
etcd 15:34:14.34 INFO  ==> There is no data from previous deployments
etcd 15:34:14.34 INFO  ==> Bootstrapping a new cluster
etcd 15:35:14.73 ERROR ==> Headless service domain does not have an IP per initial member in the cluster

kubectl -n mayastor get ep

NAME                             ENDPOINTS                                                     AGE
mayastor-agent-core                                                                            6m29s
mayastor-api-rest                                                                              6m30s
mayastor-etcd                                                                                  6m29s
mayastor-etcd-headless           10.244.3.6:2380,10.244.4.4:2380,10.244.5.4:2380 + 3 more...   6m30s
mayastor-loki                    10.244.3.5:3100                                               6m30s
mayastor-loki-headless           10.244.3.5:3100                                               6m30s
mayastor-metrics-exporter-pool   <none>                                                        6m30s

If you find info missing, please ask. I am rather new to kubernetes and not quite the flexer with all the kubectl commands, parameters and flags. Please be clear about what you need from me. Thanks for understanding!

tiagolobocastro commented 1 year ago

Since this is on Talos maybe @datacore-tilangovan has some clues here?

MerNat commented 1 year ago

The same thing is happening on my side. any solution for that?

cswaas commented 1 year ago

I'm having the same issue with my physical cluster, Talos OS v1.4.8 and Mayastor v2.3.0. Any solution, I'm stuck for many days and hours.

pl4nty commented 1 year ago

I resolved this on Mayastor 2.4.0 (Talos 1.5.1) by disabling etcd persistence, but I'm not sure whether that'll break mayastor

tiagolobocastro commented 1 year ago

Yes that will break mayastor. How did you install 2.4? The docs seem out of date, will fix this.. By default it now comes with openebs's localpv so you don't need to change the storage class for etcd nor loki.

pl4nty commented 1 year ago

@tiagolobocastro thanks, I saw some else's cluster using persistence: false but it seemed dangerous. the manifests I used are here: https://github.com/pl4nty/lab-infra/blob/main/kubernetes/cluster-1/system/mayastor/mayastor.yaml

I used this Talos config to fix localpv write issues, but had the etcd issue afterwards:

machine:
  kubelet:
    extraMounts:
    - destination: /var/local/localpv-hostpath
      type: bind
      source: /var/local/localpv-hostpath
      options:
      - bind
      - rshared
      - rw
tiagolobocastro commented 1 year ago

That does seem dangerous as atm there's no way of rebuilding the configuration if etcd data is lost. Maybe @datacore-tilangovan can help here with those issues on Talos.

aep commented 1 year ago

maybe it helps someone else:

i hit this when accidently leaving the default crio-bridge ips in /etc/cni/net.d/100-crio-bridge.conf something in mayastor persists them until you reinstall the whole helm chart

sigi-tw commented 12 months ago

We face the same issue on a rke2 based cluster with using mayastore through openebs helm chart.

Haven't found a clue or direction to analyse this more. due to the issue with etcd, a lot of other things are not coming up

tiagolobocastro commented 12 months ago

We face the same issue on a rke2 based cluster with using mayastore through openebs helm chart.

Haven't found a clue or direction to analyse this more. due to the issue with etcd, a lot of other things are not coming up

You mean your etcd was setup with persistence: false ?

sigi-tw commented 12 months ago

@tiagolobocastro nope; I'm just upgraded from an old openebs helm chart to the newest one, activated mayastore and get the same initial error message "Headless service domain...".

I haven't changed anything yet or did anything besides the 'Prepare Cluster' by updating the Hugempages.

The only other remark regarding this error message from etcd is from bitnami and it was some ipv4/ipv6 issue a year ago.

typokign commented 11 months ago

edit: solved, not entirely sure what happened but k8s was having trouble scheduling all three pods. uninstalling and reinstalling the chart unborked it.

I am also encountering this in a fresh Talos v1.5.5 cluster, with Mayastor 2.4.0.

The "Headless service domain does not have an IP per initial member in the cluster" seems very strange considering the headless service does appear to be defined and resolving to the IP of the running etcd pod:

$ kubectl get service mayastor-etcd-headless
NAME                     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
mayastor-etcd-headless   ClusterIP   None         <none>        2379/TCP,2380/TCP   29m
$ kubectl -n default exec -it test-pod -- /bin/sh
/ # nslookup mayastor-etcd-headless.mayastor.svc.cluster.local
Server:     10.96.0.10
Address:    10.96.0.10:53

Name:   mayastor-etcd-headless.mayastor.svc.cluster.local
Address: 10.244.69.208

/ # ^D
$ kubectl get pods -o wide | grep mayastor-etcd
mayastor-etcd-0                                 0/1     Pending      0               17m   <none>          <none>         <none>           <none>
mayastor-etcd-1                                 0/1     Pending      0               27m   <none>          <none>         <none>           <none>
mayastor-etcd-2                                 0/1     Running      7 (3m10s ago)   27m   10.244.69.208   k8s-worker-3   <none>           <none>
sigi-tw commented 11 months ago

My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.

The error message was not indicating an issue like this but it now works.

tiagolobocastro commented 11 months ago

My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.

The error message was not indicating an issue like this but it now works.

Great but please be aware that with 2 nodes etcd will not tolerate any node failure: https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members Perhaps you could set it to 1 only, then at least it'd tolerate the other node failing.

sigi-tw commented 11 months ago

@tiagolobocastro i was thinking about spinning a third node for that etcd and also for some extra replication.

But is that etcd really used that intensivly? I have restarted one of my two nodes a few times for testing and nothing happened. Couldn't find anything yet which would explain it

tiagolobocastro commented 11 months ago

It's not use intensively no, only when configuration changes happen. The reason we like to have 3 is for high-availability. If you only have 1 etcd instance, and that node goes down we cannot make any changes to a volume, including handling data-plane pod failure for example.

sigi-tw commented 11 months ago

If its not used intensively at all, why not using the control plane etcd?

aep commented 11 months ago

If its not used intensively at all, why not using the control plane etcd?

likely because mayastor works outside of k8s

sigi-tw commented 11 months ago

If its not used intensively at all, why not using the control plane etcd?

likely because mayastor works outside of k8s

Do you mean the mayastore driver etc.? or general?

Because the etcd is running inside k8s as a pod and the mission statement for openEBS sounds to me also k8s as the main focus:

"Mayastor is a performance optimised "Container Attached Storage" (CAS) solution of the CNCF project OpenEBS. The goal of OpenEBS is to extend Kubernetes with a declarative data plane, providing flexible persistent storage for stateful applications."

Could be a good approach to save resources and increse stability to have the option to use the etcd from k8s ctrl plane.

tiagolobocastro commented 11 months ago

K8s is the main focus indeed but I'd say we have more advantages by having a flexible approach and not locking ourselves into k8s. For example, we have a user which deployed the data-plane as systemd services when they were too far behind on the k8s versions, to avoid restarting mayastor too many times. This probably more useful when mayastor was a bit less stable but still pretty cool that it could be done.

There are of course k8s specific things atm: helm chart, kubectl-plugin, auto-upgrade etc.. Most core components are mostly k8s agnostic though. This also makes it very easy to develop for and test locally just by running binaries or deploying in docker containers. Example: https://github.com/openebs/mayastor-control-plane/tree/develop/deployer

Having the proxy implement different pstor flavours (etcd, nats or k8s) seems like a good way forward, allowing the user to choose how to deploy it, maybe configurable via helm for example.

yee379 commented 11 months ago

i have the same issue; same as original proceedure, but not overriding the default storage class. on talos 1.5.5 and mayastor 2.4.0:

❯ kgpwide -A
mayastor         mayastor-etcd-0                                 0/1     Running     40 (5m39s ago)   3h31m   10.244.0.25   talos-f3h-jfc   <none>           <none>

❯ kg pvc -A
mayastor    data-mayastor-etcd-0      Bound     pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15   2Gi        RWO            mayastor-etcd-localpv   3h31m
❯ kdpvc -n mayastor    data-mayastor-etcd-0
Name:          data-mayastor-etcd-0
Namespace:     mayastor
StorageClass:  mayastor-etcd-localpv
Status:        Bound
Volume:        pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15
Labels:        app.kubernetes.io/instance=mayastor
               app.kubernetes.io/name=etcd
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: openebs.io/local
               volume.kubernetes.io/selected-node: talos-f3h-jfc
               volume.kubernetes.io/storage-provisioner: openebs.io/local
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       mayastor-etcd-0
Events:        <none>

❯ talosctl ls var/local/localpv-hostpath/mayastor/etcd/pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15 -H
NODE         NAME
172.16.0.5   .

❯ klf -n mayastor         mayastor-etcd-0
Defaulted container "etcd" out of: etcd, volume-permissions (init)
etcd 04:22:32.14
etcd 04:22:32.14 Welcome to the Bitnami etcd container
etcd 04:22:32.15 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 04:22:32.15 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 04:22:32.15
etcd 04:22:32.15 INFO  ==> ** Starting etcd setup **
etcd 04:22:32.18 INFO  ==> Validating settings in ETCD_* env vars..
etcd 04:22:32.18 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 04:22:32.19 INFO  ==> Initializing etcd
etcd 04:22:32.19 INFO  ==> Generating etcd config file using env variables
etcd 04:22:32.22 INFO  ==> There is no data from previous deployments
etcd 04:22:32.22 INFO  ==> Bootstrapping a new cluster
etcd 04:23:32.38 ERROR ==> Headless service domain does not have an IP per initial member in the cluster
marcolongol commented 8 months ago

I encountered the same issue with Talos v1.6.4 and Kubernetes v1.29.2. It appears that the culprit was my custom dnsDomain setting in the Talos configuration file.

For others experiencing this problem, it's advisable to verify if a custom dnsDomain is specified in your configuration file. You can find more information about configuring dnsDomain in the Talos documentation here.

To troubleshoot DNS records, I spawned an ephemeral container using the following command in the same namespace:

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot

Then, I utilized dig to check if any records were found with mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local

It turns out that the custom dnsDomain provided replaces cluster.local, thereby affecting DNS resolution.

tarelda commented 5 months ago

I have setup single node cluster with microk8s and couldn't get openebs to startup successfully. After ironing out basic path issues, etcd cluster was still not coming up. Thus mayastor was unusable.

My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.

The error message was not indicating an issue like this but it now works.

As suggested I tried to reduce replicaCount to 1. This worked flawlessly and etcd cluster came up, but every other number caused one etcd replica to be in boot loop while the other were in Pending state. After bit of poking around in bitnami chart and reading https://github.com/bitnami/charts/issues/13880 . I realised that function hostname_has_N_ips if fails causes error "Headless service domain does not have an IP per initial member in the cluster". Its real purpose is to check if all of the cluster initial members (in our case configured replicas) are online. So in my situation this condition was impossible to satisfy. Then I came up with idea that this might be caused by scheduler itself.

https://github.com/openebs/mayastor-extensions/blob/4c8ad151c94f48a6cc6c5259083165b41609237d/chart/values.yaml#L510

From my understanding setting this setting in chart for mayastor-extension causes scheduler to not schedule all of the replicas on one node. Setting it default or "soft" allowed pods to be scheduled on one node.

To get mayastor to work on microk8s overrides for values.yml for openebs/openebs chart should look something along this lines:

mayastor:
  csi:
    node:
      kubeletDir: "/var/snap/microk8s/common/var/lib/kubelet/"
  etcd:
    replicaCount: 3
    podAntiAffinityPreset: ""
    localpvScConfig:
      basePath: "/var/snap/microk8s/common/var/openebs/local/{{ .Release.Name }}/etcd" 
  loki-stack:
    localpvScConfig:
      basePath: "/var/snap/microk8s/common/var/openebs/local/{{ .Release.Name }}/loki"

I fully acknowledge that my setup is different from multi node clusters of other participants in this discussion, but this issue and bitnami chart one are only ones that came up from googling error message. I hope this might help.

tiagolobocastro commented 4 months ago

A bunch of different issues seem to have fallen here, number of domains, clusterDomain and microk8s kubelet path. For non clusterDomain you can set etcd.clusterDomain. I think we should document this @avishnu ?

linonetwo commented 4 months ago

Same error here with a k8s on local machine with 1 master and 1 worker. This happened on both master node and worker node.

Maybe I should use "Installation with Replicated Storage Disabled"


I have to Disable Replicated Storage, since I currently only have 1 worker, and only add another next month.

arnoldas500 commented 2 months ago

Is there a solution to this issue yet? I am facing the same problem with 3 control nodes and replicated storage enabled.

tiagolobocastro commented 3 weeks ago

I think a few things can address some of the issues reported here:

  1. non local cluster domain name, etcd needs to be installed with the name set: https://openebs.io/docs/main/faqs/faqs#how-do-i-install-replicated-pv-mayastor-on-a-kubernetes-cluster-with-a-custom-domain
  2. etcd cluster state fix: https://github.com/openebs/mayastor-extensions/pull/536. This is not released yet so you could WA this by modifying etcd.initialClusterState to "existing" on your helm user values.

Since there's been a lot of issues reported here, it's a bit confusing to figure out which ones are which, maybe let's close this one and please create a new ticket for your issues so we can track it anew.

maximemoreillon commented 1 week ago

I'm having the same issue with a single control-plane and 3 worker Talos install, following the instructions at https://www.talos.dev/v1.8/kubernetes-guides/configuration/replicated-local-storage-with-openebs/

liasica commented 1 week ago

I have same question ERROR ==> Headless service domain does not have an IP per initial member in the cluster

tiagolobocastro commented 1 week ago

Can you attach a support bundle here? Here's the docs: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/advanced-operations/supportability

maximemoreillon commented 1 week ago

After reinstalling everything, I manged to get it working. No idea what went wrong the first time