tasket / wyng-backup

Fast backups for logical volumes & disk images
GNU General Public License v3.0
250 stars 16 forks source link

Archive corruption #221

Open Eric678 opened 2 days ago

Eric678 commented 2 days ago

Rather than try to build test cases for all the outstanding issues in my notes from a while ago, the last of which matches the title here, I will just summarise them as areas where some attention may be required: (some apply to the qubes wrapper)

  1. Deleting a qube from the running system, then deleting it from the archive causes a vm-XXX-private not found error, exiting and leaving the .tick LV with all the data.

  2. Following on from entry in #218 the backup that failed with tar exit 1, reran and backup failed at same qube, so tried to delete it from the archive and got odd close() error:

    Traceback (most recent call last):
    File "/usr/local/bin/wyng", line 5232, in <module>
    delete_volume(storage, aset, **{"vid": options.vid} if options.vid
    File "/usr/local/bin/wyng", line 4692, in delete_volume
    aset.dest.run(["rm -rf '%s'" % dvid], destcd=aset.dest.path)
    File "/usr/local/bin/wyng", line 2417, in run
    return do_exec([cmd], infile=infile, inlines=inlines, out=out,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/bin/wyng", line 2799, in do_exec
    return close()[-1]
           ^^^^^^^ 
    File "/usr/local/bin/wyng", line 2788, in close
    raise SPr.CalledProcessError([x for x in rclist if x][0], commands)
    subprocess.CalledProcessError: Command '[['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'media', '/usr/bin/sh /tmp/wyngrpc/tmpkavw_6rf']]' returned non-zero exit status 1.

    That output is from another failure, as I did not keep the terminal output from the the delete above, same result though. Removing the .tick and .tock LVs allowed the archive delete to go through OK.

  3. At this point I had an archive with 3 different associated chunk sizes: 64k root volume, archive with 128k internal chunksize and other volumes with 512k external chunksize - perhaps that caused the rest of my problems? Attempting to delete the root volume from the archive got the above odd close() error, root.tick left present and active - I removed it. The next command was very slow to start - around a minute - assume it had detected a problem and was repairing? Deleted another volume from the archive that was an import-other and got the same close() error above, except this time next command startup was quick?

  4. During the backup I noticed that 1 qube was much bigger than it should have been - around 5GB - tracked back to incomplete cleanup by qubes after one of my many system crashes, after a bad dom0 update - -private filesystem was marked clean with errors (lots of orphaned inodes and bad block & inode maps and free counts) - not surprising as that qube has a high throughput of files. After cleaning it up I ran another wyng backup of that qube and its size was OK and list showed the 2 sessions as expected. I then ran wyng-util-qubes prune qube, given my reading of the doco sounded like this should prune all except the latest which is what I wanted and it failed Please specify a session, an autoprune mode or both. now autoprune was set in wyng.ini?

  5. So ran the prune specifying the 1st session (initial full image with the corrupted fs) explicitly and it went thru OK. This brings up another issue - it appears the session time labels are internally including the user's TZ, not using UTC and just presenting in local TZ? Had odd session labels from when I backed up while in a different timezone?

  6. A list of the qube gave the correct result, just 1 session. Then it was all bad - an arch-check does not get past 1st pass showing 6 other volumes have been corrupted by the above prune like:

    
    vm-anon-whonix-private - S_20241019-160537 - ValueError('Vol_17d24c Session S_20241019-160537 hash hxwi7mZNi3GD3XGTRInozL-L_X5eZTTMX8JlBV-KCFs=, expected f38oqTe9gG7o8M6BJENvCGqr6_gAL4_e-jFcSIZjHJE=')
    Missing oldest. [0] = S_20241027-101452 -> S_20241019-160537

Volume 'vm-anon-whonix-private' (Vol_17d24c) cannot be recovered. vm-arch-private - S_20241019-160537 - ValueError('Vol_18e888 Session S_20241019-160537 hash EYlsaPXF14PXNPiGwMzUmmNpLbn1EzlMNhl6BsV5DB0=, expected xr8Dk-1UQH1g6jRCIyt0bh7seKu7NM00bAmQkKdtXxo=') Missing oldest. [0] = S_20241027-101452 -> S_20241019-160537

[ed] I now notice that the 6 volumes that have been corrupted are the first 6 volumes in the backup before the volume that failed with the tar exit 1 previously and they thus had a couple of null backups in following runs until I managed to delete the problem volume.  The volume that I pruned is further down the backup.

The above single session, single volume prune is the only prune I have done and most everything now fails during startup:

Traceback (most recent call last): File "/usr/local/bin/wyng", line 5054, in aset = get_configs(options) ; dest = aset.dest ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/wyng", line 2230, in get_configs aset = get_configs_remote(dest, cachedir, opts) ; os.utime(aset.path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/wyng", line 2307, in get_configs_remote return ArchiveSet(arch_dir, dest, opts, children=depth, allvols=True, prior_auth=aset) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/wyng", line 223, in init self.load_volumes(children) File "/usr/local/bin/wyng", line 254, in load_volumes if not handle: raise e ^^^^^^^ File "/usr/local/bin/wyng", line 250, in load_volumes vol = ArchiveVolume(self, vid, hashval, pjoin(self.path,vid), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/wyng", line 427, in init self.load_sessions() File "/usr/local/bin/wyng", line 449, in load_sessions if not handle: raise e ^^^^^^^ File "/usr/local/bin/wyng", line 445, in load_sessions s = sessions[sname] = Ses(self, sname, s.hashval, path+"/"+sname) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/wyng", line 654, in init raise ValueError("%s Session %s hash %s, expected %s" \ ValueError: Vol_17d24c Session S_20241019-160537 hash hxwi7mZNi3GD3XGTRInozL-L_X5eZTTMX8JlBV-KCFs=, expected f38oqTe9gG7o8M6BJENvCGqr6_gAL4_e-jFcSIZjHJE=


I deleted that broken archive a while ago as I needed the space, sorry...
tasket commented 1 day ago

Will have to look into the first two; I may ask for more info. A hash error generally indicates that something on the backup got corrupted, and this can happen due to backup drive or filesystem corruption or just plain archive data corruption.

  1. The chunksize is not an issue when moving between archives. However, its plausible that regularly switching between archives can trigger bugs. I noticed you're using Thin LVM and multiple archives aren't smoothly accommodated in that env (if the same vol is in multiple archives, otherwise it shouldn't matter).
  2. Yeah, you need to add some options to that. Autoprune params probably didn't cover that older session, in that case use --session=nnnnnnnn-nnnnnn and maybe also --all-before if there is more than one to remove.
  3. Yes, there is a readme comment under troubleshooting or tips that implies the session labels are local tz. For that reason sessions may appear to be out-of-order when they're actually not.