longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.03k stars 595 forks source link

Backup shows "Invalid Date" #987

Open ainiml opened 4 years ago

ainiml commented 4 years ago

Backup data is still intact, and exists.

Browser: Ubuntu Chrome Version 81.0.4000.3 (Official Build) dev (64-bit)

image

Possibly related: https://github.com/longhorn/longhorn/issues/227

ainiml commented 4 years ago

Is there a way to fix this manually, so I can continue testing? Otherwise, I'll have to delete everything and re-create all the volumes

yasker commented 4 years ago

@smallteeths What's the reason for CreatedAt to show Invalid date?

@ainiml What's the result if you use <longhorn_url>/v1/backupvolumes? This is the API UI was reading from, can you check the field created here?

ainiml commented 4 years ago

@smallteeths What's the reason for CreatedAt to show Invalid date?

@ainiml What's the result if you use <longhorn_url>/v1/backupvolumes? This is the API UI was reading from, can you check the field created here?

Is the URL like this? https://rancher:8443/k8s/clusters/c-jzh49/api/v1/backupvolumes

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "the server could not find the requested resource",
  "reason": "NotFound",
  "details": {

  },
  "code": 404
}
yasker commented 4 years ago

Oh, our management API URL doesn't work behind the Rancher proxy currently...

@ainiml Is it possible for you to create a node-port or xip ingress temporarily to access the management API? You can create a node port service in longhorn-system with target set to longhorn-ui and port 8000. Remember to remove the service later though.

ainiml commented 4 years ago

Oh, our management API URL doesn't work behind the Rancher proxy currently...

@ainiml Is it possible for you to create a node-port or xip ingress temporarily to access the management API? You can create a node port service in longhorn-system with target set to longhorn-ui and port 8000. Remember to remove the service later though.

volume.cfg indeed does not exist in that path:

"error": "cannot find backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/volume.cfg in backupstore",

image

"actions": {
"backupDelete": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupDelete",
"backupGet": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupGet",
"backupList": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66?action=backupList",
},
"backups": null,
"baseImage": "",
"created": "",
"dataStored": "0",
"id": "pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
"lastBackupAt": "",
"lastBackupName": "",
"links": {
"self": "…/v1/backupvolumes/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
},
"messages": {
"error": "cannot find backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/volume.cfg in backupstore",
},
"name": "pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66",
"size": "0",
"type": "backupVolume",
},

But there are plenty of backup configs

image

I'll try to restore volume.cfg from one of the backups

yasker commented 4 years ago

That's weird. Can you check if you make a new backup, would the volume.cfg be there? Maybe something happened to the volume.cfg.

ainiml commented 4 years ago

@yasker

Copying over the backup.cfg allowed you to see the backups again

~ # mc cp dream/longhorn/backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d768ea66/backups/backup_backup-182693ffe65842d1.cfg dream/longhorn/backupstore/volumes/83/72/pvc-ebe9fe2b-787a-4166-b55a-3b89d76
8ea66/volume.cfg
...ckup_backup-182693ffe65842d1.cfg:  3.83 KiB / 3.83 KiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 14.27 KiB/s 0s~ # 

image

ainiml commented 4 years ago

@yasker

So something happened to the volume.cfg file, but it has backup configs. Is it possible for Longhorn to restore from the backup?

Since the name and path to the volume and backups are pretty unique, I think it could be safe to restore what's missing automatically (with some sort of initial error)?

ainiml commented 4 years ago

@yasker

I think it might also be because the backup backend doesn't use something like copy-on-write, so if writing to the file is interrupted, it ends up being corrupted, and deleted (might be why volume.cfg doesn't exist)

ainiml commented 4 years ago

@yasker

I just copied all the backups back to volume.cfg, and the backups are showing again. That took quite a lot of time to do

image

yasker commented 4 years ago

@ainiml It's not expected. We want to check more on how can this happen.

ainiml commented 4 years ago

@ainiml It's not expected. We want to check more on how can this happen.

My best guess is the backend crashes. And when scrubbing on repair and remount, it deletes the corrupted files, including volume.cfg

yasker commented 4 years ago

@ainiml Do you mean backup target is NFS server on the same node? We recommend creating backups outside the Kubernetes cluster. Otherwise, if you lost the cluster, you can lose all your data and backups.

ainiml commented 4 years ago

@ainiml Do you mean backup target is NFS server on the same node? We recommend creating backups outside the Kubernetes cluster. Otherwise, if you lost the cluster, you can lose all your data and backups.

Backup target is a Minio bucket. Minio backend is s3ql. S3ql backend is S3.

Longhorn backend is XFS on s3backer. S3backer backend is S3.

The XFS or Ext4 requirements is a very big restriction on what Longhorn can run on

yasker commented 4 years ago

It's hard for us to support anything other than bare metal disk or cloud provider provided disks. I think s3backer is the problem here.

ainiml commented 4 years ago

It's hard for us to support anything other than bare metal disk or cloud provider provided disks. I think s3backer is the problem here.

yeah s3backer crashes when restoring all the volumes all at once. We're restoring the backups individually now.

joshimoo commented 4 years ago

@ainiml for 1.0 we fixed some issues on the backup side. the volume.cfg now will never be deleted unless the user requests a complete deletion of the backup volume (delete all backups).

For the nfs backup target we use the os.rename syscall so that in a crash we either end up with the old or the new data. Same applies to the S3 backup target. There is no case where we would end up with half the files data.

Please let us know if you have additional backup issues after upgrading to 1.0 :)