openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
740 stars 106 forks source link

mayastor replication does not create a new copy on available Pool in workers #1749

Open Rammurthy5 opened 2 weeks ago

Rammurthy5 commented 2 weeks ago

Describe the bug while using OpenEBS replicated storage (Mayastor) in my Kubernetes cluster, created a Mayastor storage class with 2 replication factors. If the worker nodes where the storage is replicated go down, it does not create a copy in available pools and attach the copy.

To Reproduce Steps to reproduce the behavior: install openebs with mayastor on talos k8s os. command i used:

helm install openebs --namespace openebs openebs/openebs --set zfs-localpv.zfsNode.encrKeysDir="/var/openebs/keys" --set mayastor.etcd.localpvScConfig.basePath="/var/openebs/local/{{ .Release.Name }}/localpv-hostpath/etcd" --set mayastor.loki-stack.localpvScConfig.basePath="/var/openebs/local/{{ .Release.Name }}/localpv-hostpath/loki" --set mayastor.loki-stack.loki.persistence.size=1Gi --set mayastor.csi.node.initContainers.enabled=false --create-namespace

Expected behavior When a worker node goes down and if there is another pool on another node then replication should create a new copy.

Screenshots

OS info (please complete the following information):

Additional context couldn't attach openebs log as it doesn't support .tar can share the logs in private / community slack etc.

tiagolobocastro commented 2 weeks ago

Adding the tar here: cluster2.tar.gz

Here were the issues:

 2024-10-06T05:03:00.175720961Z stdout F   [2m2024-10-06T05:03:00.175316Z[0m [31mERROR[0m [1;31mcore::volume::operations_helper[0m[31m: [31mFailed to attach replica to nexus, [1;31mreplica.uuid[0m[31m: ef78c63c-7cec-4f54-9911-1507467a01e6, [1;31mreplica.pool[0m[31m: gcp-225, [1;31mreplica.node[0m[31m: gcp-225, [1;31merror[0m[31m: "gRPC request 'share_replica' for 'Replica' failed with 'status: Internal, message: \"failed to share lvol ef78c63c-7cec-4f54-9911-1507467a01e6: NVMe persistence through power-loss failure: File exists (os error 17)\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Sun, 06 Oct 2024 05:03:00 GMT\", \"content-length\": \"0\"} }': status: Internal, message: \"failed to share lvol ef78c63c-7cec-4f54-9911-1507467a01e6: NVMe persistence through power-loss failure: File exists (os error 17)\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Sun, 06 Oct 2024 05:03:00 GMT\", \"content-length\": \"0\"} }"[0m

And same for the nexus.