[BUG] invalid memory address or nil pointer dereference in BackupVolumeController

fgleixner commented 1 year ago

Crashloop of longhorn manager during syncing Backups

We had some performance problems with our S3 (minio) Storage. These problems are resolved, but now the synchronisation of backups with longhorn seem to work and it seems that missing backups in S3 cause longhorn manager to crash

To Reproduce

Probably simly delete files in S3 storage?

Expected behavior

Synchronisation should not crash

Support bundle for troubleshooting

Will create one

Environment

Longhorn version: 1.4.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Kubespray 2.22.1
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 14
Node config
- OS type and version: SLES 15.5
- Kernel version: 5.14.21
- CPU per node: from 4 to 54 CPU cores
- Memory per node: from 8GB to 192 GB
- Disk type(e.g. SSD/NVMe/HDD): HDD and SSD
- Network bandwidth between the nodes: 1GB to 10 GB
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare and Baremetal
Number of Longhorn volumes in the cluster: 67
Impacted Longhorn resources:
- Volume names: many

Additional context

We see this in the error log:

time="2023-10-30T08:33:06Z" level=info msg="Found 3 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=pvc-b300031b-2c16-483a-a485-3558c2058910 controller=longhorn-backup-volume node=kube-nsr0-03 E1030 08:33:06.189153 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 794 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1d8cfa0, 0x358e2a0}) /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001009760}) /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1d8cfa0, 0x358e2a0}) /usr/local/go/src/runtime/panic.go:1038 +0x215 github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).reconcile(0xc0004b7f00, {0xc000153e90, 0x28}) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:327 +0x1f2d github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).syncHandler(0xc0004b7f00, {0xc000153e80, 0x0}) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:146 +0x118 github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).processNextWorkItem(0xc0004b7f00) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:128 +0xdb github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).worker(...) ....

innobead commented 1 year ago

@ChanYiLin can you take a look or see if this is something we already fixed?

cc @mantissahz

mantissahz commented 1 year ago

https://github.com/longhorn/longhorn-manager/blob/v1.4.x/controller/backup_volume_controller.go#L328 It seems I forgot to check the nil again.

innobead commented 1 year ago

@fgleixner Thanks for reporting this. This will be handled in the following releases.

innobead commented 1 year ago

cc @longhorn/qa

innobead commented 1 year ago

It seems we already mutated it, so it should not be an issue except the mutation did not work correctly.

/webhook/resources/backup/mutator.go#L59-L76

    if backupLabels == nil {
        backupLabels = make(map[string]string)
    }
    volumeName, isExist := backup.Labels[types.LonghornLabelBackupVolume]
    if !isExist {
        err := errors.Wrapf(err, "cannot find the backup volume label for backup %v", backup.Name)
        return nil, werror.NewInvalidError(err.Error(), "")
    }

    if _, isExist := backupLabels[types.GetLonghornLabelKey(types.LonghornLabelVolumeAccessMode)]; !isExist {
        volumeAccessMode := longhorn.AccessModeReadWriteOnce
        if volume, err := b.ds.GetVolumeRO(volumeName); err == nil {
            if volume.Spec.AccessMode != "" {
                volumeAccessMode = volume.Spec.AccessMode
            }
        }
        backupLabels[types.GetLonghornLabelKey(types.LonghornLabelVolumeAccessMode)] = string(volumeAccessMode)
    }

fgleixner commented 1 year ago

Do you need more information? I cannot upload the complete support bundle, because it contains many yamls with probably sensitive information, but i can upload logfiles or provide other information. Is there any way to workaround the issue to get the longhorn manager pods working again? 10 out of 17 Nodes show down ...

innobead commented 1 year ago

@mantissahz corrected me that the backup info is the decoded result from the backup client communicating with the remote backup target. It's for syncing the non-existing backup CRs from remote, so it's different from the backup CRs created from the cluster.

mantissahz commented 1 year ago

@fgleixner sure, the support bundle is helpful. Could you also provide the content of backup.cfg on the remote backup store?

And the backups on the remote backup store was created by which Longhorn version? The workaround is to add an empty label in the backup.cfg.

fgleixner commented 1 year ago

logs-from-longhorn-manager-in-longhorn-manager-kbfwh.log logs-from-longhorn-manager-in-longhorn-manager-4gv8x.log

Here are some logs. Supportbundle is too big for upload, even when i omit logs from other namespaces.

What do you mean with backup.cfg? I do backups to a minio deployment on another cluster. Do you mean the secret information or the access permissions to the bucket in minio?

mantissahz commented 1 year ago

@fgleixner,

What do you mean with backup.cfg?

Could you check the backup config file, it would be like "backupstore/volumes/d7/f9/v1/backups/backup_backup-xxxxxx.cfg" in the bucket on the minio server.

I do backups to a minio deployment on another cluster

Which Longhorn version did you install in this another cluster?

Do you mean the secret information or the access permissions to the bucket in minio?

No, not secret information. It (backup config file) was created by Longhorn.

And I have a patched image on v1.4.3 longhorn-manager image, do you want to give it a try? Here it is jamesluhz/lh-manager:v1.4.3p1 on docker hub

fgleixner commented 1 year ago

@mantissahz

I am doing backups using velero with csi snapshots. I have configured a S3 backup target to a minio running in another cluster. This cluster also uses longhorn with version 1.5.1, but i think this not relevant, since minio could also run on another storageclass.

From the Logfile logs-from-longhorn-manager-in-longhorn-manager-kbfwh.log there is a message:

time="2023-10-30T09:29:50Z" level=error msg="Error listing backups from backup target" backupVolume=pvc-b8d5c087-8a09-442c-9f35-2587f671c412 controller=longhorn-backup-volume error="cannot find backupstore/volumes/9b/33/pvc-b8d5c087-8a09-442c-9f35-2587f671c412/volume.cfg in backupstore" node=kube-nsr0-03

But in minio this path backupstore/volumes/9b/33/pvc-b8d5c087-8a09-442c-9f35-2587f671c412/volume.cfg is deleted. If i switch on "Show deleted objects", then i can see the volume.cfg and in backups some backup_backup-.cfg.

One of these (deleted) config files is attached. backup_backup-f120795c6fee43d8.cfg.txt

I will try your longhorn-manager image soon and report.

fgleixner commented 1 year ago

Just to be sure. To try your image i have to do:

$ cat patch.yaml
image:
  longhorn:
    manager:
      repository: jamesluhz/lh-manager
      tag: v1.4.3p1
$ helm -n longhorn-system upgrade longhorn longhorn/longhorn --version 1.4.3 -f patch.yaml

?

fgleixner commented 1 year ago

@mantissahz Tried withe the above helm command, Your image was pulled, but the nodes, that have problems still encounter the same errors.

fgleixner commented 1 year ago

I also tried the v1.4.3p1-debug image. Output is attached logs-from-longhorn-manager-in-longhorn-manager-8rn55.log

mantissahz commented 1 year ago

@fgleixner Thanks for trying the patched image. I just realize that the backupInfo is nil and I had another patched image to check this so please give it a try.

jamesluhz/lh-manager:v1.4.3p2

And could you help me show contents of the pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5 volume.cfg and its two backups' backup.cfg on the minio server? I would like to know why we can not get the information, is it empty or corrupted? Are you still able to do a backup for the volume pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5 in another cluster with Longhorn 1.5? Maybe you can generate the support bundle for first patched images and send us only logs' part without yaml files and that will be helpful.

longhorn-io-github-bot commented 1 year ago

Pre Ready-For-Testing Checklist

[ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
1. Prepare some backups on the remote backup store.
2. Empty the setting backup-target and backups should be deleted.
3. Empty or corrupt the backup.cfg files on the remote backup store.
4. Setup the setting backup-target
5. backups should be synchronized.
[ ] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)? The PR is at https://github.com/longhorn/longhorn-manager/pull/2276
[ ] Which areas/issues this PR might have potential impacts on? Area Issues

fgleixner commented 1 year ago

@mantissahz there are some changes i observed since yesterday; On the cluster with 1.4.3 where yesterday 13 of 17 nodes had the problem, now only 4 out of 17 nodes still have the problem. I changed to the 1.4.3p2 image and the output of one node is attached. logs-from-longhorn-manager-in-longhorn-manager-7szkh.log

On another cluster with Longhorn 1.5.1, that did not have a problem yesterday, now 5 out of 8 noes have the problem.

My guess: s3 storage has problems and sometimes backup.cfg or volume.cfg gets deleted (i really dont know why).

Regular backup schedule then generates new backups and this "repairs" the volumes.

Backup job on the first cluster is still running since yesterday.

I will try to get supportbundle of the clusters for you.

fgleixner commented 1 year ago

@mantissahz How can i find the directory of pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5 in minio? there are two layers with every layer a hex number.

fgleixner commented 1 year ago

@mantissahz OK, i found the directory. Attached is the last version of volume.cfg, but volume.cfg was only visible, when i clicked "show deleted objects". Also attached the two backup.cfgs which were visible. volume.cfg.txt backup_backup-56b7559dee9841ef.cfg.txt backup_backup-6fd3fd3486094379.cfg.txt

fgleixner commented 1 year ago

Output of longhorn-manager-7szkh

longhorn-manager-7szkh.yaml.txt

mantissahz commented 1 year ago

Output of longhorn-manager-7szkh longhorn-manager-7szkh.yaml.txt

Yeah, it is still from v1.4.3p1 image.

fgleixner commented 1 year ago

Update: Forgot to include the values file in helm upgrade, Patched versions of the image fix the problem. The problem is still why the backup.cfg or volume.cfg are gone, which has to be investigated separately.

roger-ryao commented 1 year ago

Verified on master-head 20231113

longhorn master-head https://github.com/longhorn/longhorn/commit/e4e577275c89b6c6c7815fc5d5eee98f03348a99
longhorn-manager master-head https://github.com/longhorn/longhorn-manager/commit/d3673dfd8c55690c5ec8dd8686a79c46b5342ef8

The test steps

https://github.com/longhorn/longhorn/issues/6999#issuecomment-1784950510

Prepare some backups on the remote S3 server.
Clear the setting for the backup target, and the backups should be deleted. Execute kubectl -n longhorn-system get backuptargets.longhorn.io -w and check the status.
Delete the backup.cfg files on the remote server.
Set up the backup target settings.
The backups should be synchronized.
Backup the volume again and check if the backup.cfg files are regenerated on the remote server.
Repeat steps 1 to 6 on the remote NFS server.

Result Passed

longhorn / longhorn