Corruption when restoring to new openstack instance volume.

banksyian commented 5 years ago

I have been testing the powershell tools to backup SolidFire volumes which are used as bootable root volumes for openstack volume backed instances.

While backups and restores work and appear to be consistent when restoring to the original volume I am having trouble restoring to new volumes in the event the original was removed. I have tried creating a new instance then restoring a backup to its root volume and also creating a blank volume, restoring a backup to it and then creating an instance from this volume. In both cases the instance will boot, but the data appears to have been corrupted.

There are corruptions and checksum errors in the kernel log and syslog, which could possibly be resolved by running a filesystem check, however as these are volume backed instances we are unable to boot them into recovery mode to run a disk check with the root volume unmounted.

Can anyone advise if they have any experience with backing up and restoring instance root disks with the tool or any advice would be appreciated.

scaleoutsean commented 5 years ago

PowerShell tools only pass API calls to Element OS, so it would be more appropriate to discuss this at https://community.netapp.com/t5/AFF-NVMe-EF-Series-and-SolidFire-Discussions/bd-p/flash-storage-systems-discussions (and it would be noticed by more folks). I suggest you copy-paste to there and add some context. If your OS is up-and-running, I would expect that the image would at best be crash-consistent (so when restored, it'd be as if OS crashed) and at worst corrupt. OS logs are open and modified at all times so I guess backup is created from a crash-consistent snapshot. Maybe you could restore them to a new volume, mount that volume from other (existing) VMs, run fsck and detach. That could be a simple script, even a container-driven one-off fsck job that runs after your Restore command.

banksyian commented 5 years ago

@scaleoutsean Thanks for the quick reply. I'll have a test with mounting to another instance and move the issue over to the netapp discussions pages if necessary.

scaleoutsean commented 5 years ago

I tested (not Openstack, but "generic") boot from iSCSI: (a) powered off and (b) live while running a big "apt-get install" Then I tried to boot from each snapshot (restored snapshot) and (b) fortunately recovered well (no impact). But I also tried to clone those snapshots and give them to another VM from which I ran fsck, and (a) was 100% clean, while (b) had errors that fsck was able to fix. Of course if I was updating kernel or doing something else, maybe it wouldn't be fixed. I'd say you need to quiesce the VM or power them off before you backup (or snapshot) such volumes. As far as I know backup-to-S3 simply takes a snapshot and works with point-in-time data.

In the past I did a backup and restore (to/from S3) with PowerShell and did md5sum of both devices, and checksums was identical so I don't suspect that snapshots or backups of quiesced disks may have a storage-caused problem.

solidfire / PowerShell

Corruption when restoring to new openstack instance volume. #81