[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state

yangchiu commented 10 months ago

Describe the bug

Delete replicas of a v2 volume during backup creation could cause the backup becomes Error state:

  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'

And remove the error backup could cause the volume stuck in detaching/faulted state: stuck

To Reproduce

Create v2 volume environment
Create a v2 volume test-1 with 3 replicas from UI and also create PV/PVC for it from UI

Create a pod for it:

cat << EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: sleep
  image: busybox
  imagePullPolicy: IfNotPresent
  args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
  volumeMounts:
    - name: pod-data
      mountPath: /data
volumes:
- name: pod-data
  persistentVolumeClaim:
    claimName: test-1
EOF
kubectl apply -f pod.yaml

Write some data to the volume:

dd if=/dev/urandom of=/data/test-1 bs=3M count=1024

Create backup from UI, and during the backup in progress, delete some replicas to make the backup be in Error state:

$ kubectl get backups -n longhorn-system backup-ae5264e951594c20 -oyaml
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
creationTimestamp: "2024-01-08T03:32:35Z"
finalizers:
- longhorn.io
generation: 1
labels:
backup-volume: test-4
name: backup-ae5264e951594c20
namespace: longhorn-system
resourceVersion: "26161"
uid: f81b9dd2-7d32-46ec-b982-a3905599ecb7
spec:
labels:
KubernetesStatus: '{"pvName":"test-4","pvStatus":"Bound","namespace":"default","pvcName":"test-4","lastPVCRefAt":"","workloadsStatus":[{"podName":"test-pod-4","podStatus":"Running","workloadName":"","workloadType":""}],"lastPodRefAt":""}'
longhorn.io/volume-access-mode: rwo
snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
syncRequestedAt: null
status:
backupCreatedAt: ""
compressionMethod: ""
error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'
labels: null
lastSyncedAt: "2024-01-08T03:32:54Z"
messages: null
ownerID: ip-10-0-1-238
progress: 10
replicaAddress: ""
size: ""
snapshotCreatedAt: "2024-01-08T03:32:35Z"
snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
state: Error
url: ""
volumeBackingImageName: ""
volumeCreated: ""
volumeName: ""
volumeSize: "21474836480"

Detach the volume to trigger offline rebuilding:
```
kubectl delete -f pod.yaml
```
After offline rebuilding completed, re-attach the volume:
```
kubectl apply -f pod.yaml
```
The volume is attached and healthy without problems, and once delete the error backup from UI, the volume gets stuck in detaching and faulted:

Please see volume test-4 related logs in the support bundle for more details.

Expected behavior

Support bundle for troubleshooting

supportbundle_43c90f10-cff2-486a-bf7b-dc91761bf1ea_2024-01-08T04-09-06Z.zip

Environment

Longhorn version: master-head (longhorn-manager a601b9b)
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: ubuntu 22.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

<!-Please add any other context about the problem here.-->

derekbit commented 10 months ago

The error is triggered by the actual size mismatching error of replica lvols. Checking the replica verification logics.

[longhorn-manager-dqzsx] time="2024-01-08T07:27:01Z" level=warning msg="Instance test-1-r-3fe7d78f is state error, error message: found mismatching lvol actual size 2750414848 with recorded prev lvol actual size 2097152 when validating lvol test-1-r-3fe7d78f-snap-rebuild-6d8f7e1c" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:206

derekbit commented 10 months ago

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

longhorn-io-github-bot commented 10 months ago

Pre Ready-For-Testing Checklist

[ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
[ ] Does the PR include the explanation for the fix or the feature?
[ ] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)? The PR is at https://github.com/longhorn/longhorn-spdk-engine/pull/90
[ ] Which areas/issues this PR might have potential impacts on? Area: v2 volume, snapshot, backup Issues

innobead commented 10 months ago

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

Sounds good to me.