Open ainiml opened 4 years ago
Is there a way to fix this manually, so I can continue testing? Otherwise, I'll have to delete everything and re-create all the volumes
@smallteeths What's the reason for CreatedAt
to show Invalid date
?
@ainiml What's the result if you use <longhorn_url>/v1/backupvolumes
? This is the API UI was reading from, can you check the field created
here?
@smallteeths What's the reason for
CreatedAt
to showInvalid date
?@ainiml What's the result if you use
<longhorn_url>/v1/backupvolumes
? This is the API UI was reading from, can you check the fieldcreated
here?
Is the URL like this? https://rancher:8443/k8s/clusters/c-jzh49/api/v1/backupvolumes
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "the server could not find the requested resource",
"reason": "NotFound",
"details": {
},
"code": 404
}
Oh, our management API URL doesn't work behind the Rancher proxy currently...
@ainiml Is it possible for you to create a node-port or xip ingress temporarily to access the management API? You can create a node port service in longhorn-system
with target set to longhorn-ui
and port 8000
. Remember to remove the service later though.
Oh, our management API URL doesn't work behind the Rancher proxy currently...
@ainiml Is it possible for you to create a node-port or xip ingress temporarily to access the management API? You can create a node port service in
longhorn-system
with target set tolonghorn-ui
and port8000
. Remember to remove the service later though.
volume.cfg indeed does not exist in that path:
"error": "cannot find backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/volume.cfg in backupstore",
"actions": {
"backupDelete": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupDelete",
"backupGet": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupGet",
"backupList": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupList",
},
"backups": null,
"baseImage": "",
"created": "",
"dataStored": "0",
"id": "pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
"lastBackupAt": "",
"lastBackupName": "",
"links": {
"self": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
},
"messages": {
"error": "cannot find backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/volume.cfg in backupstore",
},
"name": "pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
"size": "0",
"type": "backupVolume",
},
But there are plenty of backup configs
I'll try to restore volume.cfg
from one of the backups
That's weird. Can you check if you make a new backup, would the volume.cfg
be there? Maybe something happened to the volume.cfg
.
@yasker
Copying over the backup.cfg allowed you to see the backups again
~ # mc cp dream/longhorn/backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/backups/backup_backup-182693ffe65842d1.cfg dream/longhorn/backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d76
8ea66/volume.cfg
...ckup_backup-182693ffe65842d1.cfg: 3.83 KiB / 3.83 KiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 14.27 KiB/s 0s~ #
@yasker
So something happened to the volume.cfg
file, but it has backup configs. Is it possible for Longhorn to restore from the backup?
Since the name and path to the volume and backups are pretty unique, I think it could be safe to restore what's missing automatically (with some sort of initial error)?
@yasker
I think it might also be because the backup backend doesn't use something like copy-on-write, so if writing to the file is interrupted, it ends up being corrupted, and deleted (might be why volume.cfg
doesn't exist)
@yasker
I just copied all the backups back to volume.cfg
, and the backups are showing again. That took quite a lot of time to do
@ainiml It's not expected. We want to check more on how can this happen.
@ainiml It's not expected. We want to check more on how can this happen.
My best guess is the backend crashes. And when scrubbing on repair and remount, it deletes the corrupted files, including volume.cfg
@ainiml Do you mean backup target is NFS server on the same node? We recommend creating backups outside the Kubernetes cluster. Otherwise, if you lost the cluster, you can lose all your data and backups.
@ainiml Do you mean backup target is NFS server on the same node? We recommend creating backups outside the Kubernetes cluster. Otherwise, if you lost the cluster, you can lose all your data and backups.
Backup target is a Minio bucket. Minio backend is s3ql. S3ql backend is S3.
Longhorn backend is XFS on s3backer. S3backer backend is S3.
The XFS or Ext4 requirements is a very big restriction on what Longhorn can run on
It's hard for us to support anything other than bare metal disk or cloud provider provided disks. I think s3backer is the problem here.
It's hard for us to support anything other than bare metal disk or cloud provider provided disks. I think s3backer is the problem here.
yeah s3backer crashes when restoring all the volumes all at once. We're restoring the backups individually now.
@ainiml for 1.0 we fixed some issues on the backup side. the volume.cfg now will never be deleted unless the user requests a complete deletion of the backup volume (delete all backups).
For the nfs backup target we use the os.rename syscall so that in a crash we either end up with the old or the new data. Same applies to the S3 backup target. There is no case where we would end up with half the files data.
Please let us know if you have additional backup issues after upgrading to 1.0 :)
Backup data is still intact, and exists.
Browser:
Ubuntu Chrome Version 81.0.4000.3 (Official Build) dev (64-bit)
Possibly related: https://github.com/longhorn/longhorn/issues/227