Closed fgleixner closed 1 year ago
@ChanYiLin can you take a look or see if this is something we already fixed?
cc @mantissahz
https://github.com/longhorn/longhorn-manager/blob/v1.4.x/controller/backup_volume_controller.go#L328 It seems I forgot to check the nil again.
@fgleixner Thanks for reporting this. This will be handled in the following releases.
cc @longhorn/qa
It seems we already mutated it, so it should not be an issue except the mutation did not work correctly.
/webhook/resources/backup/mutator.go#L59-L76
if backupLabels == nil {
backupLabels = make(map[string]string)
}
volumeName, isExist := backup.Labels[types.LonghornLabelBackupVolume]
if !isExist {
err := errors.Wrapf(err, "cannot find the backup volume label for backup %v", backup.Name)
return nil, werror.NewInvalidError(err.Error(), "")
}
if _, isExist := backupLabels[types.GetLonghornLabelKey(types.LonghornLabelVolumeAccessMode)]; !isExist {
volumeAccessMode := longhorn.AccessModeReadWriteOnce
if volume, err := b.ds.GetVolumeRO(volumeName); err == nil {
if volume.Spec.AccessMode != "" {
volumeAccessMode = volume.Spec.AccessMode
}
}
backupLabels[types.GetLonghornLabelKey(types.LonghornLabelVolumeAccessMode)] = string(volumeAccessMode)
}
Do you need more information? I cannot upload the complete support bundle, because it contains many yamls with probably sensitive information, but i can upload logfiles or provide other information. Is there any way to workaround the issue to get the longhorn manager pods working again? 10 out of 17 Nodes show down ...
@mantissahz corrected me that the backup info is the decoded result from the backup client communicating with the remote backup target. It's for syncing the non-existing backup CRs from remote, so it's different from the backup CRs created from the cluster.
@fgleixner sure, the support bundle is helpful. Could you also provide the content of backup.cfg on the remote backup store?
And the backups on the remote backup store was created by which Longhorn version? The workaround is to add an empty label in the backup.cfg.
logs-from-longhorn-manager-in-longhorn-manager-kbfwh.log logs-from-longhorn-manager-in-longhorn-manager-4gv8x.log
Here are some logs. Supportbundle is too big for upload, even when i omit logs from other namespaces.
What do you mean with backup.cfg? I do backups to a minio deployment on another cluster. Do you mean the secret information or the access permissions to the bucket in minio?
@fgleixner,
What do you mean with backup.cfg?
Could you check the backup config file, it would be like "backupstore/volumes/d7/f9/v1/backups/backup_backup-xxxxxx.cfg" in the bucket on the minio server.
I do backups to a minio deployment on another cluster
Which Longhorn version did you install in this another
cluster?
Do you mean the secret information or the access permissions to the bucket in minio?
No, not secret information. It (backup config file) was created by Longhorn.
And I have a patched image on v1.4.3 longhorn-manager
image, do you want to give it a try?
Here it is jamesluhz/lh-manager:v1.4.3p1
on docker hub
@mantissahz
I am doing backups using velero with csi snapshots. I have configured a S3 backup target to a minio running in another cluster. This cluster also uses longhorn with version 1.5.1, but i think this not relevant, since minio could also run on another storageclass.
From the Logfile logs-from-longhorn-manager-in-longhorn-manager-kbfwh.log there is a message:
time="2023-10-30T09:29:50Z" level=error msg="Error listing backups from backup target" backupVolume=pvc-b8d5c087-8a09-442c-9f35-2587f671c412 controller=longhorn-backup-volume error="cannot find backupstore/volumes/9b/33/pvc-b8d5c087-8a09-442c-9f35-2587f671c412/volume.cfg in backupstore" node=kube-nsr0-03
But in minio this path backupstore/volumes/9b/33/pvc-b8d5c087-8a09-442c-9f35-2587f671c412/volume.cfg is deleted. If i switch on "Show deleted objects", then i can see the volume.cfg and in backups some backup_backup-
One of these (deleted) config files is attached. backup_backup-f120795c6fee43d8.cfg.txt
I will try your longhorn-manager image soon and report.
Just to be sure. To try your image i have to do:
$ cat patch.yaml
image:
longhorn:
manager:
repository: jamesluhz/lh-manager
tag: v1.4.3p1
$ helm -n longhorn-system upgrade longhorn longhorn/longhorn --version 1.4.3 -f patch.yaml
?
@mantissahz Tried withe the above helm command, Your image was pulled, but the nodes, that have problems still encounter the same errors.
I also tried the v1.4.3p1-debug image. Output is attached logs-from-longhorn-manager-in-longhorn-manager-8rn55.log
@fgleixner Thanks for trying the patched image. I just realize that the backupInfo is nil and I had another patched image to check this so please give it a try.
And could you help me show contents of the pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5 volume.cfg
and its two backups' backup.cfg
on the minio server?
I would like to know why we can not get the information, is it empty or corrupted?
Are you still able to do a backup for the volume pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5
in another cluster with Longhorn 1.5?
Maybe you can generate the support bundle for first patched images and send us only logs' part without yaml files and that will be helpful.
[ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
backup-target
and backups should be deleted.backup-target
[ ] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*
)?
The PR is at
https://github.com/longhorn/longhorn-manager/pull/2276
[ ] Which areas/issues this PR might have potential impacts on? Area Issues
@mantissahz there are some changes i observed since yesterday; On the cluster with 1.4.3 where yesterday 13 of 17 nodes had the problem, now only 4 out of 17 nodes still have the problem. I changed to the 1.4.3p2 image and the output of one node is attached. logs-from-longhorn-manager-in-longhorn-manager-7szkh.log
On another cluster with Longhorn 1.5.1, that did not have a problem yesterday, now 5 out of 8 noes have the problem.
My guess: s3 storage has problems and sometimes backup.cfg or volume.cfg gets deleted (i really dont know why).
Regular backup schedule then generates new backups and this "repairs" the volumes.
Backup job on the first cluster is still running since yesterday.
I will try to get supportbundle of the clusters for you.
@mantissahz How can i find the directory of pvc-28d5cdf6-b627-4c85-85cb-29c4c98dffe5 in minio? there are two layers with every layer a hex number.
@mantissahz OK, i found the directory. Attached is the last version of volume.cfg, but volume.cfg was only visible, when i clicked "show deleted objects". Also attached the two backup.cfgs which were visible. volume.cfg.txt backup_backup-56b7559dee9841ef.cfg.txt backup_backup-6fd3fd3486094379.cfg.txt
Output of longhorn-manager-7szkh
Output of longhorn-manager-7szkh longhorn-manager-7szkh.yaml.txt
Yeah, it is still from v1.4.3p1 image.
Update: Forgot to include the values file in helm upgrade, Patched versions of the image fix the problem. The problem is still why the backup.cfg or volume.cfg are gone, which has to be investigated separately.
Verified on master-head 20231113
The test steps
https://github.com/longhorn/longhorn/issues/6999#issuecomment-1784950510
S3
server.kubectl -n longhorn-system get backuptargets.longhorn.io -w
and check the status.backup.cfg
files on the remote server.backup.cfg
files are regenerated on the remote server.NFS
server.Result Passed
Crashloop of longhorn manager during syncing Backups
We had some performance problems with our S3 (minio) Storage. These problems are resolved, but now the synchronisation of backups with longhorn seem to work and it seems that missing backups in S3 cause longhorn manager to crash
To Reproduce
Probably simly delete files in S3 storage?
Expected behavior
Synchronisation should not crash
Support bundle for troubleshooting
Will create one
Environment
Additional context
We see this in the error log:
time="2023-10-30T08:33:06Z" level=info msg="Found 3 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=pvc-b300031b-2c16-483a-a485-3558c2058910 controller=longhorn-backup-volume node=kube-nsr0-03 E1030 08:33:06.189153 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 794 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1d8cfa0, 0x358e2a0}) /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001009760}) /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1d8cfa0, 0x358e2a0}) /usr/local/go/src/runtime/panic.go:1038 +0x215 github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).reconcile(0xc0004b7f00, {0xc000153e90, 0x28}) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:327 +0x1f2d github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).syncHandler(0xc0004b7f00, {0xc000153e80, 0x0}) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:146 +0x118 github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).processNextWorkItem(0xc0004b7f00) /go/src/github.com/longhorn/longhorn-manager/controller/backup_volume_controller.go:128 +0xdb github.com/longhorn/longhorn-manager/controller.(BackupVolumeController).worker(...) ....