longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.11k stars 599 forks source link

[BUG] Delete error backup could cause v2 volume stuck in detaching/faulted state #7575

Closed yangchiu closed 10 months ago

yangchiu commented 10 months ago

Describe the bug

Delete replicas of a v2 volume during backup creation could cause the backup becomes Error state:

  error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'

And remove the error backup could cause the volume stuck in detaching/faulted state: stuck

To Reproduce

  1. Create v2 volume environment
  2. Create a v2 volume test-1 with 3 replicas from UI and also create PV/PVC for it from UI
  3. Create a pod for it:
    cat << EOF > pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
    name: test-pod
    spec:
    containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
    volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
    EOF
    kubectl apply -f pod.yaml
  4. Write some data to the volume:
    dd if=/dev/urandom of=/data/test-1 bs=3M count=1024
  5. Create backup from UI, and during the backup in progress, delete some replicas to make the backup be in Error state:
    $ kubectl get backups -n longhorn-system backup-ae5264e951594c20 -oyaml
    apiVersion: longhorn.io/v1beta2
    kind: Backup
    metadata:
    creationTimestamp: "2024-01-08T03:32:35Z"
    finalizers:
    - longhorn.io
    generation: 1
    labels:
    backup-volume: test-4
    name: backup-ae5264e951594c20
    namespace: longhorn-system
    resourceVersion: "26161"
    uid: f81b9dd2-7d32-46ec-b982-a3905599ecb7
    spec:
    labels:
    KubernetesStatus: '{"pvName":"test-4","pvStatus":"Bound","namespace":"default","pvcName":"test-4","lastPVCRefAt":"","workloadsStatus":[{"podName":"test-pod-4","podStatus":"Running","workloadName":"","workloadType":""}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
    snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
    syncRequestedAt: null
    status:
    backupCreatedAt: ""
    compressionMethod: ""
    error: 'proxyServer=10.42.2.36:8501 destination=10.42.1.31:20013: failed to get
    backup-ae5264e951594c20 backup status: rpc error: code = Internal desc = failed
    to get backup status: rpc error: code = NotFound desc = replica address 10.42.3.34:20007
    is not found in engine test-4-e-0 for getting backup backup-ae5264e951594c20 status'
    labels: null
    lastSyncedAt: "2024-01-08T03:32:54Z"
    messages: null
    ownerID: ip-10-0-1-238
    progress: 10
    replicaAddress: ""
    size: ""
    snapshotCreatedAt: "2024-01-08T03:32:35Z"
    snapshotName: fdc06fc9-89b7-4bb5-a958-032f51d75a2c
    state: Error
    url: ""
    volumeBackingImageName: ""
    volumeCreated: ""
    volumeName: ""
    volumeSize: "21474836480"
  6. Detach the volume to trigger offline rebuilding:
    kubectl delete -f pod.yaml
  7. After offline rebuilding completed, re-attach the volume:
    kubectl apply -f pod.yaml
  8. The volume is attached and healthy without problems, and once delete the error backup from UI, the volume gets stuck in detaching and faulted: stuck

Please see volume test-4 related logs in the support bundle for more details.

Expected behavior

Support bundle for troubleshooting

supportbundle_43c90f10-cff2-486a-bf7b-dc91761bf1ea_2024-01-08T04-09-06Z.zip

Environment

Additional context

<!-Please add any other context about the problem here.-->

derekbit commented 10 months ago

The error is triggered by the actual size mismatching error of replica lvols. Checking the replica verification logics.

[longhorn-manager-dqzsx] time="2024-01-08T07:27:01Z" level=warning msg="Instance test-1-r-3fe7d78f is state error, error message: found mismatching lvol actual size 2750414848 with recorded prev lvol actual size 2097152 when validating lvol test-1-r-3fe7d78f-snap-rebuild-6d8f7e1c" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:206
derekbit commented 10 months ago

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

longhorn-io-github-bot commented 10 months ago

Pre Ready-For-Testing Checklist

innobead commented 10 months ago

This is a transition error occuring in codes. When deleting a snapshot lvol, the merge of lvols results in a change of actual size.

I'm thinking if we really need the actual size validation. Can we remove the check and revisit it later if it is needed? WDYT? @shuo-wu @innobead

Sounds good to me.

roger-ryao commented 10 months ago

Verified on master-head 20240109

The test steps

  1. https://github.com/longhorn/longhorn/issues/7575#issue-2069616052

Result passed