[BUG] Detached volumes not replicating - data loss potential - easy reproduction

kallisti5 commented 2 years ago

Describe the bug

If a volume enters a detached state, it no longer replicates.

To Reproduce

Steps to reproduce the behavior:

Create two PVCS on a three node cluster used by two different deployments
Scale one deployment down to zero
The volume of the scaled down deployment will become "detached"
- Replica counts can no longer be updated in ui.
- Cause: https://github.com/longhorn/longhorn-ui/blob/master/src/routes/volume/VolumeActions.js#L163
Taint + Recycle each k8s node in the node pool
While longhorn says 2 healthy and 2 detached after the recycle of each node in the pool, in truth there are 2 healthy and 2 failed volumes (the two detached ones)

Expected behavior

Longhorn should maintain replicas even when volume is in a detached state, otherwise data-loss is quietly guaranteed.

Side enhancement: Longhorn should prompt users about the status of various taint-related migrations.

Log or Support bundle

As shown in the screenshots. Volume replicas were not maintained during a rolling recycle of nodes.

example_a example b

Environment

Longhorn version: 1.2.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Digital Ocean
- Number of management node in the cluster: 2
- Number of worker node in the cluster: 3
Node config
- OS type and version:
- CPU per node: 4
- Memory per node: 8 GiB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 1Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Digital Ocean
Number of Longhorn volumes in the cluster: 4

kallisti5 commented 2 years ago

Scaling back to 1 replicas... as expected errors seen:

AttachVolume.Attach failed for volume "pvc-8d8a12e9-a0f0-40fb-aff8-1c5121be403a" : rpc error: code = Aborted desc = volume pvc-8d8a12e9-a0f0-40fb-aff8-1c5121be403a is not ready for workloads

Longhorn wasn't even aware of the fault until trying to use the detached volume

kallisti5 commented 2 years ago

After attempting to use the data on the "unreplicated detached volume", longhorn finally realizes that the volume is faulted.

faulted

Notice the other detached volume is also in a fault condition, but longhorn won't realize it until I attempt to use it.

derekbit commented 2 years ago

Longhorn assumes user does not touch the replicas. In detached state, one possible solution is checking the existence of the replicas periodically, but it still cannot detect the changes of the data in replicas by uses. On the other hand, computing the checksum periodically is not a good idea in either running or detached state. The volume-head or snapshots is modified by users or applications in the running state, and the computation also consumes computation and storage resources.

kallisti5 commented 2 years ago

@derekbit even ignoring the replica setting aspect. If you have a deployment scaled to zero (and pods in that deployment are the only consumers), the (still fully valid pvc's) will slowly break if the kubernetes nodes are recycled since they aren't replicating any longer in a detached state.

Honestly, this was the final nail in us not using longhorn. Data loss is too easy, backup restoration is too difficult.

innobead commented 2 years ago

This is a fair concern because right now Longhorn will not do replication when the volume is detached status. This seems rather sensitive when running Longhorn on a managed K8s cluster.

jdbaudean commented 2 years ago

Hi I was wondering if there was an ETA on when this might be addressed? We'd like to use longhorn with rancher autoscaling but autoscaling down will eventually result in data loss if detached volumes aren't replicated.

innobead commented 2 years ago

Hi I was wondering if there was an ETA on when this might be addressed? We'd like to use longhorn with rancher autoscaling but autoscaling down will eventually result in data loss if detached volumes aren't replicated.

We will see if we can do something for 1.5, but right now just added to the backlog first.

joshimoo commented 2 years ago

@jdbaudean consider using 2 scaling groups i.e. (fixed storage set (can be scaled up but not down), dynamic worker set (can be arbitrarily scaled))

If your provider does automatic node recycling after a time you need to ensure that the longhorn data disk is not located on the default node disk.

Instead a dedicated disk need to be attached to the recycling nodes, otherwise every time the node gets recycled your data will be gone.

rlipscombe commented 2 years ago

I've run into this because I've got a couple of degraded longhorn volumes. The rebuild keeps failing.

I assume it's because I've got too much load in the cluster (it's k3s on 5xRPi4 nodes, for experimentation). In order to reduce the load (particularly on the degraded volumes), I scaled the affected deployments down to zero replicas. But now the replication/rebuild doesn't run.

ADN182 commented 1 year ago

Hi I have a blocking point with that (volume not replicated when is detached) !

Indeed if you try to drain node, for maintenance (Upgrade cluster) with a pod with a nodeselector on that node, volume is detached and will not be replicated and the node will never be drained to respect the PDB because that is the latest replicat.

What is the way to drain a node in that case ? This is a real issue, if it's a volume is in detached state, it should be sure that all replicat are healthy before stopping the replicat process !

lflfm commented 11 months ago

Any ETA on this? This is actually a very serious bug and makes Longhorn not suitable for production!

I was just starting to setup a production cluster with Longhorn and now I can't use it :disappointed:

I just lost a bunch storage here because of this while upgrading my k8s version. Lucky this was just some production-like test environments and, of course, I have off-site backups so no data was actually lost but it can't be that we have to resort to DR measures because some service wasn't online during an infrastructure upgrade - and this will also happen without any upgrade if the service is down long enough for the each node holding its existing replicas to be refreshed.

notsrch commented 6 months ago

Is there any update or ETA on this? As the above posters state it is a pretty serious bug.

innobead commented 6 months ago

This is about the feature "offline replica rebuilding".

We are tentatively planning for 1.8.

cc @derekbit

kallisti5 commented 6 months ago

Thanks for keeping the work going!

Longhorn is a cool CSI, and has a massive potential in shift solving a very real-world k8s problem of RWX at smaller scales... however this one bug was enough to make us go with other options.

derekbit commented 6 months ago

We can extend the v2 offline replica rebuilding to v1 in v1.8. Please see the ticket https://github.com/longhorn/longhorn/issues/8443 and https://github.com/longhorn/longhorn/blob/master/enhancements/20230616-automatic-offline-replica-rebuild.md

longhorn / longhorn