Closed Hr46ph closed 3 weeks ago
Since this is on Talos maybe @datacore-tilangovan has some clues here?
The same thing is happening on my side. any solution for that?
I'm having the same issue with my physical cluster, Talos OS v1.4.8 and Mayastor v2.3.0. Any solution, I'm stuck for many days and hours.
I resolved this on Mayastor 2.4.0 (Talos 1.5.1) by disabling etcd persistence, but I'm not sure whether that'll break mayastor
Yes that will break mayastor. How did you install 2.4? The docs seem out of date, will fix this.. By default it now comes with openebs's localpv so you don't need to change the storage class for etcd nor loki.
@tiagolobocastro thanks, I saw some else's cluster using persistence: false
but it seemed dangerous. the manifests I used are here: https://github.com/pl4nty/lab-infra/blob/main/kubernetes/cluster-1/system/mayastor/mayastor.yaml
I used this Talos config to fix localpv write issues, but had the etcd issue afterwards:
machine:
kubelet:
extraMounts:
- destination: /var/local/localpv-hostpath
type: bind
source: /var/local/localpv-hostpath
options:
- bind
- rshared
- rw
That does seem dangerous as atm there's no way of rebuilding the configuration if etcd data is lost. Maybe @datacore-tilangovan can help here with those issues on Talos.
maybe it helps someone else:
i hit this when accidently leaving the default crio-bridge ips in /etc/cni/net.d/100-crio-bridge.conf something in mayastor persists them until you reinstall the whole helm chart
We face the same issue on a rke2 based cluster with using mayastore through openebs helm chart.
Haven't found a clue or direction to analyse this more. due to the issue with etcd, a lot of other things are not coming up
We face the same issue on a rke2 based cluster with using mayastore through openebs helm chart.
Haven't found a clue or direction to analyse this more. due to the issue with etcd, a lot of other things are not coming up
You mean your etcd was setup with persistence: false
?
@tiagolobocastro nope; I'm just upgraded from an old openebs helm chart to the newest one, activated mayastore and get the same initial error message "Headless service domain...".
I haven't changed anything yet or did anything besides the 'Prepare Cluster' by updating the Hugempages.
The only other remark regarding this error message from etcd is from bitnami and it was some ipv4/ipv6 issue a year ago.
edit: solved, not entirely sure what happened but k8s was having trouble scheduling all three pods. uninstalling and reinstalling the chart unborked it.
I am also encountering this in a fresh Talos v1.5.5 cluster, with Mayastor 2.4.0.
The "Headless service domain does not have an IP per initial member in the cluster" seems very strange considering the headless service does appear to be defined and resolving to the IP of the running etcd pod:
$ kubectl get service mayastor-etcd-headless
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mayastor-etcd-headless ClusterIP None <none> 2379/TCP,2380/TCP 29m
$ kubectl -n default exec -it test-pod -- /bin/sh
/ # nslookup mayastor-etcd-headless.mayastor.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10:53
Name: mayastor-etcd-headless.mayastor.svc.cluster.local
Address: 10.244.69.208
/ # ^D
$ kubectl get pods -o wide | grep mayastor-etcd
mayastor-etcd-0 0/1 Pending 0 17m <none> <none> <none> <none>
mayastor-etcd-1 0/1 Pending 0 27m <none> <none> <none> <none>
mayastor-etcd-2 0/1 Running 7 (3m10s ago) 27m 10.244.69.208 k8s-worker-3 <none> <none>
My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.
The error message was not indicating an issue like this but it now works.
My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.
The error message was not indicating an issue like this but it now works.
Great but please be aware that with 2 nodes etcd will not tolerate any node failure: https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members Perhaps you could set it to 1 only, then at least it'd tolerate the other node failing.
@tiagolobocastro i was thinking about spinning a third node for that etcd and also for some extra replication.
But is that etcd really used that intensivly? I have restarted one of my two nodes a few times for testing and nothing happened. Couldn't find anything yet which would explain it
It's not use intensively no, only when configuration changes happen. The reason we like to have 3 is for high-availability. If you only have 1 etcd instance, and that node goes down we cannot make any changes to a volume, including handling data-plane pod failure for example.
If its not used intensively at all, why not using the control plane etcd?
If its not used intensively at all, why not using the control plane etcd?
likely because mayastor works outside of k8s
If its not used intensively at all, why not using the control plane etcd?
likely because mayastor works outside of k8s
Do you mean the mayastore driver etc.? or general?
Because the etcd is running inside k8s as a pod and the mission statement for openEBS sounds to me also k8s as the main focus:
"Mayastor is a performance optimised "Container Attached Storage" (CAS) solution of the CNCF project OpenEBS. The goal of OpenEBS is to extend Kubernetes with a declarative data plane, providing flexible persistent storage for stateful applications."
Could be a good approach to save resources and increse stability to have the option to use the etcd from k8s ctrl plane.
K8s is the main focus indeed but I'd say we have more advantages by having a flexible approach and not locking ourselves into k8s. For example, we have a user which deployed the data-plane as systemd services when they were too far behind on the k8s versions, to avoid restarting mayastor too many times. This probably more useful when mayastor was a bit less stable but still pretty cool that it could be done.
There are of course k8s specific things atm: helm chart, kubectl-plugin, auto-upgrade etc.. Most core components are mostly k8s agnostic though. This also makes it very easy to develop for and test locally just by running binaries or deploying in docker containers. Example: https://github.com/openebs/mayastor-control-plane/tree/develop/deployer
Having the proxy implement different pstor flavours (etcd, nats or k8s) seems like a good way forward, allowing the user to choose how to deploy it, maybe configurable via helm for example.
i have the same issue; same as original proceedure, but not overriding the default storage class. on talos 1.5.5 and mayastor 2.4.0:
❯ kgpwide -A
mayastor mayastor-etcd-0 0/1 Running 40 (5m39s ago) 3h31m 10.244.0.25 talos-f3h-jfc <none> <none>
❯ kg pvc -A
mayastor data-mayastor-etcd-0 Bound pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15 2Gi RWO mayastor-etcd-localpv 3h31m
❯ kdpvc -n mayastor data-mayastor-etcd-0
Name: data-mayastor-etcd-0
Namespace: mayastor
StorageClass: mayastor-etcd-localpv
Status: Bound
Volume: pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15
Labels: app.kubernetes.io/instance=mayastor
app.kubernetes.io/name=etcd
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: openebs.io/local
volume.kubernetes.io/selected-node: talos-f3h-jfc
volume.kubernetes.io/storage-provisioner: openebs.io/local
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 2Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: mayastor-etcd-0
Events: <none>
❯ talosctl ls var/local/localpv-hostpath/mayastor/etcd/pvc-ca8c9180-435e-4227-b4d2-d4e07b4adc15 -H
NODE NAME
172.16.0.5 .
❯ klf -n mayastor mayastor-etcd-0
Defaulted container "etcd" out of: etcd, volume-permissions (init)
etcd 04:22:32.14
etcd 04:22:32.14 Welcome to the Bitnami etcd container
etcd 04:22:32.15 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 04:22:32.15 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 04:22:32.15
etcd 04:22:32.15 INFO ==> ** Starting etcd setup **
etcd 04:22:32.18 INFO ==> Validating settings in ETCD_* env vars..
etcd 04:22:32.18 WARN ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 04:22:32.19 INFO ==> Initializing etcd
etcd 04:22:32.19 INFO ==> Generating etcd config file using env variables
etcd 04:22:32.22 INFO ==> There is no data from previous deployments
etcd 04:22:32.22 INFO ==> Bootstrapping a new cluster
etcd 04:23:32.38 ERROR ==> Headless service domain does not have an IP per initial member in the cluster
I encountered the same issue with Talos v1.6.4 and Kubernetes v1.29.2. It appears that the culprit was my custom dnsDomain
setting in the Talos configuration file.
For others experiencing this problem, it's advisable to verify if a custom dnsDomain is specified in your configuration file. You can find more information about configuring dnsDomain in the Talos documentation here.
To troubleshoot DNS records, I spawned an ephemeral container using the following command in the same namespace:
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
Then, I utilized dig
to check if any records were found with mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local
It turns out that the custom dnsDomain
provided replaces cluster.local
, thereby affecting DNS resolution.
I have setup single node cluster with microk8s and couldn't get openebs to startup successfully. After ironing out basic path issues, etcd cluster was still not coming up. Thus mayastor was unusable.
My issue got solved when i reduced the number of mayastor-etcd pods from 3 to 2 as we only run 2 nodes.
The error message was not indicating an issue like this but it now works.
As suggested I tried to reduce replicaCount to 1. This worked flawlessly and etcd cluster came up, but every other number caused one etcd replica to be in boot loop while the other were in Pending state. After bit of poking around in bitnami chart and reading https://github.com/bitnami/charts/issues/13880 . I realised that function hostname_has_N_ips if fails causes error "Headless service domain does not have an IP per initial member in the cluster". Its real purpose is to check if all of the cluster initial members (in our case configured replicas) are online. So in my situation this condition was impossible to satisfy. Then I came up with idea that this might be caused by scheduler itself.
From my understanding setting this setting in chart for mayastor-extension causes scheduler to not schedule all of the replicas on one node. Setting it default or "soft" allowed pods to be scheduled on one node.
To get mayastor to work on microk8s overrides for values.yml for openebs/openebs chart should look something along this lines:
mayastor:
csi:
node:
kubeletDir: "/var/snap/microk8s/common/var/lib/kubelet/"
etcd:
replicaCount: 3
podAntiAffinityPreset: ""
localpvScConfig:
basePath: "/var/snap/microk8s/common/var/openebs/local/{{ .Release.Name }}/etcd"
loki-stack:
localpvScConfig:
basePath: "/var/snap/microk8s/common/var/openebs/local/{{ .Release.Name }}/loki"
I fully acknowledge that my setup is different from multi node clusters of other participants in this discussion, but this issue and bitnami chart one are only ones that came up from googling error message. I hope this might help.
A bunch of different issues seem to have fallen here, number of domains, clusterDomain and microk8s kubelet path.
For non clusterDomain you can set etcd.clusterDomain
. I think we should document this @avishnu ?
Same error here with a k8s on local machine with 1 master and 1 worker. This happened on both master node and worker node.
Maybe I should use "Installation with Replicated Storage Disabled"
I have to Disable Replicated Storage, since I currently only have 1 worker, and only add another next month.
Is there a solution to this issue yet? I am facing the same problem with 3 control nodes and replicated storage enabled.
I think a few things can address some of the issues reported here:
Since there's been a lot of issues reported here, it's a bit confusing to figure out which ones are which, maybe let's close this one and please create a new ticket for your issues so we can track it anew.
I'm having the same issue with a single control-plane and 3 worker Talos install, following the instructions at https://www.talos.dev/v1.8/kubernetes-guides/configuration/replicated-local-storage-with-openebs/
I have same question
ERROR ==> Headless service domain does not have an IP per initial member in the cluster
Can you attach a support bundle here? Here's the docs: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/advanced-operations/supportability
After reinstalling everything, I manged to get it working. No idea what went wrong the first time
Describe the bug I hope this isn't something silly wrong with my cluster ... ;-)
I deployed MayaStor 2.2.0 on a Talos cluster 1.4.5.
When installing Mayastor, the result is that the mayastor-etcd pods crash with an error and restart.
ERROR ==> Headless service domain does not have an IP per initial member in the cluster
To Reproduce Steps to reproduce the behavior: I installed 3 control planes and 3 worker nodes on KVM running on Arch Linux.
I followed the basic steps to configure Talos (sans vagrant I deployed the vm's scripted): https://www.talos.dev/v1.4/talos-guides/install/virtualized-platforms/vagrant-libvirt/
Once configured and verified running, I applied the huge pages patch and rebooted the worker nodes.
I created the namespace maya store and applied privileges:
Applied with:
kubectl apply -f mayastor-namespace-privileged.yaml
Next added helm repo and installed as follows:
helm install mayastor mayastor/mayastor -n mayastor --version 2.2.0 --set='etcd.persistence.storageClass=manual,loki-stack.loki.persistence.storageClassName=manual'
And observed the etcd nodes crashing and restarting.
Expected behavior Running mayastor pods ready to configure.
OS info (please complete the following information):
Additional context https://github.com/openebs/mayastor/issues/1368
Logs and other output:
kubectl logs -n mayastor mayastor-etcd-0 -f
kubectl -n mayastor get ep
If you find info missing, please ask. I am rather new to kubernetes and not quite the flexer with all the kubectl commands, parameters and flags. Please be clear about what you need from me. Thanks for understanding!