Closed weihao closed 2 years ago
I'm now wondering if the save-off, save-all flush, and sync are not blocking. In which case, maybe the script should somehow watch for the world files content to settle before proceeding. Do you have any theories or suggestions?
I ran it again and the same error happened. Looking at the log the save-on
and save-off
commands are all executed at the same time. This will cause this file change error if changes are still writing to the directory
2022-01-20T15:32:32+0000 INFO waiting initial delay of 2m...
2022-01-20T15:34:32+0000 INFO waiting for rcon readiness...
2022-01-20T15:34:32+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T15:34:32+0000 INFO Command executed successfully rcon-cli save-off
2022-01-20T15:34:33+0000 INFO Command executed successfully rcon-cli save-all flush
2022-01-20T15:34:33+0000 INFO Command executed successfully sync
2022-01-20T15:34:33+0000 INFO Backing up content in /data to /backups/world-20220120-153433.tgz
tar: ./world: file changed as we read it
tar: ./world_the_end: file changed as we read it
2022-01-20T15:54:14+0000 WARN tar exited with code 1. Ignoring
2022-01-20T15:54:15+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T15:54:15+0000 INFO sleeping 1d..
@itzg any updates on this?
Sorry, I kept getting distracted away from this one. I'm now wondering if a solution could be as simple as retrying the operation if tar detects a change.
if the save
rcon commands are not blocked correctly, retry will certainly fail.
That's why I included a 5 second delay between retries. Can you please try out the update to see if it helps at least?
...otherwise, please suggest a solution that would accommodate the non-blocking aspect of rcon save.
I got the same "file changed as we read it". The first backup directly after the container-start worked fine.
2022-07-11T13:25:38+0000 INFO waiting initial delay of 2m...
2022-07-11T13:27:38+0000 INFO waiting for rcon readiness...
2022-07-11T13:27:38+0000 INFO Command executed successfully rcon-cli save-on
2022-07-11T13:27:38+0000 INFO Command executed successfully rcon-cli save-off
2022-07-11T13:27:39+0000 INFO Command executed successfully rcon-cli save-all flush
2022-07-11T13:27:39+0000 INFO Command executed successfully sync
2022-07-11T13:27:39+0000 INFO Backing up content in /data to /backups/world-20220711-132739.tgz
2022-07-11T13:29:34+0000 INFO Command executed successfully rcon-cli save-on
2022-07-11T13:29:34+0000 INFO sleeping 12h...
2022-07-12T01:29:35+0000 INFO Command executed successfully rcon-cli save-off
2022-07-12T01:29:35+0000 INFO Command executed successfully rcon-cli save-all flush
2022-07-12T01:29:35+0000 INFO Command executed successfully sync
2022-07-12T01:29:35+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
tar: ./world: file changed as we read it
2022-07-12T01:31:28+0000 INFO ...retrying backup in 5 seconds
2022-07-12T01:31:33+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
2022-07-12T01:33:33+0000 INFO ...retrying backup in 5 seconds
2022-07-12T01:33:38+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
2022-07-12T01:35:36+0000 WARN Giving up on this round of backup
2022-07-12T01:35:36+0000 INFO Command executed successfully rcon-cli save-on
2022-07-12T01:35:36+0000 INFO sleeping 12h...
Bummer. Something else must be going on then. I'm going to need one of you to debug and fix this since I can't reproduce the particular scenario that causes this.
This started happening to me almost every day. I tried adding very long delays (10s) after every command but it didn't help. I have now wrapped the backup function with:
tree -apugsD -o "${DEST_DIR}/${BACKUP_NAME}-${ts}-tree-before.log" "${SRC_DIR}/world"
# command tar ...
tree -apugsD -o "${DEST_DIR}/${BACKUP_NAME}-${ts}-tree-after.log" "${SRC_DIR}/world"
so hopefully any changes will be visible in the diff
between the two files. Anyone else having the issue can try this too, although I've been using a highly customized version of this project so I'm not sure if all the variables are the same (+ you would need to install tree
into the container).
Hopefully I will be able to see something in 1-2 days.
Thanks for the debugging on that @chylex . I'm looking forward to anything you can find/fix.
Not sure I understand what happened, but here's the result
Looking at the Minecraft sources, the server performs a save every 5 minutes (which seems to match the 5 minute difference between .dat
and .dat_old
). During this periodic save, /save-off
has no effect on the level.dat
file and player data.
I guess that when my world file got large enough that a backup takes 30 seconds, it's became very likely for the 5 minute timer to fire during a backup.
Minecraft saves these files by first saving the data into a temporary file, and then deleting and moving (renaming) files in a sequence dat_old <- dat <- tmp
.
In theory, on Linux file renames are atomic, and file handles stay valid if the file is moved (renamed) or even deleted. If tar
detects a change in any file that is saved in this way, i.e. there exists a file with the _old
suffix next to it, I would say there is no risk of having a corrupted archive.
Side note: If
level.dat
did get corrupted, Minecraft would try to restore thelevel.dat_old
file, but it doesn't look like it would do the same for player data files.
Thanks for that analysis @chylex . That is annoying that save-off
doesn't truly tell the server to stop all saves; however, if it's a balance of chunk memory management then it makes sense.
What suggestions do you have for going forward? Should the retry logic be removed since it doesn't really help in these scenarios? Should the log message be adjusted?
On my server, I ended up splitting archival and compression into separate stages, so a backup happens in this sequence:
/save-off
/save-all flush
tar
into the container/save-on
zstd
into the destination folderSince the file changes tar
detects are most likely harmless, I just wanted to reduce the chance of seeing the errors, since I get an email every time a tar
error is in the container logs. Now the archival stage takes about 3 seconds instead of 30; it does temporarily double the world size on the host since it's no longer being piped directly through compression and into the destination folder on a CIFS share, but in my case this is not an issue.
A more complete solution might be to use or create an NBT validation tool to check that the backed up files are in fact not corrupted, or use tree
to know exactly which files changed so you can judge whether the changed files are a "false positive" level.dat
or player data file, or if there's actual corruption happening elsewhere.
That is annoying that
save-off
doesn't truly tell the server to stop all saves; however, if it's a balance of chunk memory management then it makes sense.
Chunk saving is disabled during /save-off
, only /save-all
is able to override it. Only level.dat
and playerdata/*.dat
files ignore /save-off
.
What suggestions do you have for going forward? Should the retry logic be removed since it doesn't really help in these scenarios? Should the log message be adjusted?
Retry logic can be removed. It looks like /save-all flush
blocks until everything is actually saved (even though the actual saving happens in a worker thread), so I also removed all arbitrary sleeps and the sync
command since neither should be necessary.
I think the most robust (and also most time-consuming) solution would be a validation step after the backup.
Ignoring tar
errors and using tree
or a similar command to find which files actually changed, then perhaps filtering out level.dat
and playerdata
folder to avoid these "false positives", and logging the rest could be enough. I'm not sure how others monitor their containers, my solution is a cron job that checks container logs and sends an email if it finds anything suspicious.
@chylex I was looking back at this one and was thinking it would be great if you could PR some of the changes you made locally.
Otherwise I have this queued up so it won't fall off my radar again.
Well, my version is based on probably a year old version from this repository, with anything I didn't need removed and with changes made specifically for my local environment, so I don't know how easily it could be merged.
I uploaded my current version to gist if you want to look and take anything from it.
That's totally fair. Thanks.
This error happens when backup process takes longer than 5 minutes. My world is 8.5G and backup takes ~6 minutes.
This can be solved in several ways:
{ tar -cf - . ; tar -cf - . ; }|gzip
. ) This requires -i
option when decompressing or files from second archive will be skipped.IMO 2.2 seems to be best solution despite it takes up to 2 world sizes of storage at worst during compression. I will submit PR for this in a while
As I said, *.dat
files are saved in a way where "file changed as we read it" can be ignored assuming that your filesystem has atomic renames. On Linux, tar
will either see the old version or the new version, it should not store some corrupted in-between version if the archival happens in the middle of a *.dat
file write, so splitting *.dat
files into separate archives doesn't make much sense to me.
@itzg i think you are closing this ticket too soon. The solution in the MR was not validated by the people who reported this issue.
That is the default workflow of GitHub. Please reopen if you want though.