shatteredsilicon / backoid

backoid - a sanoid/syncoid-like utility for object storage backup targets
GNU General Public License v3.0
0 stars 0 forks source link

Backups that take longer than retention period don't get saved #15

Open gordan-bobic opened 2 days ago

gordan-bobic commented 2 days ago

An extreme edge case, but an important one.

If a backup takes, say, 8 days to complete, but retention period is specified to 7 days, the backup gets completed, but then immediately deleted because it is expired, even though there are no other backups.

data/mysql@autosnap_2024-10-28_00:01:02_daily: 6.09TiB 170:45:57 [10.4MiB/s] [===========================>] 100%
DEBUG: dumping snapshot data/mysql@autosnap_2024-10-27_00:01:01_daily to /tmp/backoid/data/mysql/autosnap_2024-10-27_00:01:01_daily
cannot open 'data/mysql@autosnap_2024-10-27_00:01:01_daily': dataset does not exist
CRITICAL ERROR: command 'zfs send data/mysql@autosnap_2024-10-27_00:01:01_daily > /tmp/backoid/data/mysql/autosnap_2024-10-27_00:01:01_daily' failed at /sbin/backoid line 369.
DEBUG: found expired object nfs:/mnt/backup/backoid/autosnap_2024-10-28_00:01:02_daily.tar.zst

So the backup was transferred (eventually), but was then deleted because it had expired by the time it was uploaded.

I am not entirely sure what the correct solution here is: 1) Document the edge case (it may be a little too absurd for sanely addressing). 2) If there are no other matching backups on target switch to different retention strategy, e.g. number of matching backup files (e.g. instead of paying attention to the dates themselves for expiry count the number of snapshots matching the pattern)

gordan-bobic commented 2 days ago

Also, this doesn't seem right:

CRITICAL ERROR: command 'zfs send data/mysql@autosnap_2024-10-27_00:01:01_daily > /tmp/backoid/data/mysql/autosnap_2024-10-27_00:01:01_daily' failed at /sbin/backoid line 369.

that should never have been issued since this is not a zvol. Is this because backoid assumed it was a zvol due to the snapshot having been deleted out from under it so it couldn't find the magic hidden .zfs folder?

If so, then this should probably be zfs send -cp to preserve compression and attributes (but it still shouldn't be issued in this case).

And if it is going to /tmp there should be a check that what zfs send -cp -nv says doesn't exceed the amount of space available on /tmp. But I'm still not sure why this was going to /tmp/ it should really be streamed directly to the target.

Note that it (correctly) started from the 10-28 snapshot. The zfs send part was for 10-27 which since expired. So there should at least be a re-check to identify whether the snapshot disappeared since we originally built the list because the initial transfer took an unreasonably long amount of time.

oblitorum commented 1 day ago

that should never have been issued since this is not a zvol. Is this because backoid assumed it was a zvol due to the snapshot having been deleted out from under it so it couldn't find the magic hidden .zfs folder?

Yeah, exactly. I will fix this in the way that you suggested.

oblitorum commented 1 day ago

About what to do with the edge case, I noticed that there is a rclone touch command that can change the modification time of an existing file. I think that we could change the modification time right after we actually uploaded it, e.g. ... | rclone rcat ... ${object} && rclone touch ... ${object}.

And then when it checks if the file is expired, it will be checking the start time that it's actually been on the remote place instead of the time it's created.

What do you think?

gordan-bobic commented 1 day ago

Hmm... Note that some endpoints will show partially written files that are still uploading. You can test this with the "local" rclone endpoint. Which means that if we have cron running hourly to back things up, a subsequent invocation could delete the file that is still being uploaded.

So we may actually need to use some kind of lock/pid files (lock file containing the pid of the rclone process so that we can check if that process is still running or whether it died and just didn't clean up. If the process is still running, we ignore that file for purposes of pruning.

I'm open to better ideas.

oblitorum commented 1 day ago

It does use lock files currently, although it use 2 separated lock files for upload process and purge process, we will need to adjust it to use same lock file if we decide to do that.

Let me see if I can come out a better idea.