file changed as we read it

weihao commented 2 years ago

  backups:
    image: itzg/mc-backup
    environment:
      INITIAL_DELAY: 2m
      BACKUP_INTERVAL: "1d"
      PRUNE_BACKUPS_DAYS: "3"
      RCON_HOST: smp
      SRC_DIR: /data
      DEST_DIR: /backups
    volumes:
      # mount the same volume used by server, but read-only
      - ./orchestration/smp/data:/data:ro
      - ./backups:/backups

2022-01-20T07:26:34+0000 INFO waiting initial delay of 2m...
2022-01-20T07:28:34+0000 INFO waiting for rcon readiness...
2022-01-20T07:28:34+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T07:28:34+0000 INFO Command executed successfully rcon-cli save-off
2022-01-20T07:28:34+0000 INFO Command executed successfully rcon-cli save-all flush
2022-01-20T07:28:34+0000 INFO Command executed successfully sync
2022-01-20T07:28:34+0000 INFO Backing up content in /data to /backups/world-20220120-072834.tgz
tar: ./world: file changed as we read it
tar: ./world_nether: file changed as we read it
2022-01-20T07:47:42+0000 WARN tar exited with code 1. Ignoring
2022-01-20T07:47:43+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T07:47:43+0000 INFO sleeping 1d...

itzg commented 2 years ago

I'm now wondering if the save-off, save-all flush, and sync are not blocking. In which case, maybe the script should somehow watch for the world files content to settle before proceeding. Do you have any theories or suggestions?

weihao commented 2 years ago

I ran it again and the same error happened. Looking at the log the save-on and save-off commands are all executed at the same time. This will cause this file change error if changes are still writing to the directory

2022-01-20T15:32:32+0000 INFO waiting initial delay of 2m...
2022-01-20T15:34:32+0000 INFO waiting for rcon readiness...
2022-01-20T15:34:32+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T15:34:32+0000 INFO Command executed successfully rcon-cli save-off
2022-01-20T15:34:33+0000 INFO Command executed successfully rcon-cli save-all flush
2022-01-20T15:34:33+0000 INFO Command executed successfully sync
2022-01-20T15:34:33+0000 INFO Backing up content in /data to /backups/world-20220120-153433.tgz
tar: ./world: file changed as we read it
tar: ./world_the_end: file changed as we read it
2022-01-20T15:54:14+0000 WARN tar exited with code 1. Ignoring
2022-01-20T15:54:15+0000 INFO Command executed successfully rcon-cli save-on
2022-01-20T15:54:15+0000 INFO sleeping 1d..

weihao commented 2 years ago

@itzg any updates on this?

itzg commented 2 years ago

Sorry, I kept getting distracted away from this one. I'm now wondering if a solution could be as simple as retrying the operation if tar detects a change.

weihao commented 2 years ago

if the save rcon commands are not blocked correctly, retry will certainly fail.

itzg commented 2 years ago

That's why I included a 5 second delay between retries. Can you please try out the update to see if it helps at least?

itzg commented 2 years ago

...otherwise, please suggest a solution that would accommodate the non-blocking aspect of rcon save.

mrmainx commented 2 years ago

I got the same "file changed as we read it". The first backup directly after the container-start worked fine.

2022-07-11T13:25:38+0000 INFO waiting initial delay of 2m...
2022-07-11T13:27:38+0000 INFO waiting for rcon readiness...
2022-07-11T13:27:38+0000 INFO Command executed successfully rcon-cli save-on
2022-07-11T13:27:38+0000 INFO Command executed successfully rcon-cli save-off
2022-07-11T13:27:39+0000 INFO Command executed successfully rcon-cli save-all flush
2022-07-11T13:27:39+0000 INFO Command executed successfully sync
2022-07-11T13:27:39+0000 INFO Backing up content in /data to /backups/world-20220711-132739.tgz
2022-07-11T13:29:34+0000 INFO Command executed successfully rcon-cli save-on
2022-07-11T13:29:34+0000 INFO sleeping 12h...
2022-07-12T01:29:35+0000 INFO Command executed successfully rcon-cli save-off
2022-07-12T01:29:35+0000 INFO Command executed successfully rcon-cli save-all flush
2022-07-12T01:29:35+0000 INFO Command executed successfully sync
2022-07-12T01:29:35+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
tar: ./world: file changed as we read it
2022-07-12T01:31:28+0000 INFO ...retrying backup in 5 seconds
2022-07-12T01:31:33+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
2022-07-12T01:33:33+0000 INFO ...retrying backup in 5 seconds
2022-07-12T01:33:38+0000 INFO Backing up content in /data to /backups/world-20220712-012935.tgz
2022-07-12T01:35:36+0000 WARN Giving up on this round of backup
2022-07-12T01:35:36+0000 INFO Command executed successfully rcon-cli save-on
2022-07-12T01:35:36+0000 INFO sleeping 12h...

itzg commented 2 years ago

Bummer. Something else must be going on then. I'm going to need one of you to debug and fix this since I can't reproduce the particular scenario that causes this.

chylex commented 2 years ago

This started happening to me almost every day. I tried adding very long delays (10s) after every command but it didn't help. I have now wrapped the backup function with:

tree -apugsD -o "${DEST_DIR}/${BACKUP_NAME}-${ts}-tree-before.log" "${SRC_DIR}/world"

# command tar ...

tree -apugsD -o "${DEST_DIR}/${BACKUP_NAME}-${ts}-tree-after.log" "${SRC_DIR}/world"

so hopefully any changes will be visible in the diff between the two files. Anyone else having the issue can try this too, although I've been using a highly customized version of this project so I'm not sure if all the variables are the same (+ you would need to install tree into the container).

Hopefully I will be able to see something in 1-2 days.

itzg commented 2 years ago

Thanks for the debugging on that @chylex . I'm looking forward to anything you can find/fix.

chylex commented 2 years ago

Not sure I understand what happened, but here's the result

obrazek

chylex commented 2 years ago

Looking at the Minecraft sources, the server performs a save every 5 minutes (which seems to match the 5 minute difference between .dat and .dat_old). During this periodic save, /save-off has no effect on the level.dat file and player data.

chylex commented 2 years ago

I guess that when my world file got large enough that a backup takes 30 seconds, it's became very likely for the 5 minute timer to fire during a backup.

Minecraft saves these files by first saving the data into a temporary file, and then deleting and moving (renaming) files in a sequence dat_old <- dat <- tmp.

In theory, on Linux file renames are atomic, and file handles stay valid if the file is moved (renamed) or even deleted. If tar detects a change in any file that is saved in this way, i.e. there exists a file with the _old suffix next to it, I would say there is no risk of having a corrupted archive.

Side note: If level.dat did get corrupted, Minecraft would try to restore the level.dat_old file, but it doesn't look like it would do the same for player data files.

itzg commented 2 years ago

Thanks for that analysis @chylex . That is annoying that save-off doesn't truly tell the server to stop all saves; however, if it's a balance of chunk memory management then it makes sense.

What suggestions do you have for going forward? Should the retry logic be removed since it doesn't really help in these scenarios? Should the log message be adjusted?

chylex commented 2 years ago

On my server, I ended up splitting archival and compression into separate stages, so a backup happens in this sequence:

rcon /save-off
rcon /save-all flush
archive with tar into the container
rcon /save-on
compress with zstd into the destination folder

Since the file changes tar detects are most likely harmless, I just wanted to reduce the chance of seeing the errors, since I get an email every time a tar error is in the container logs. Now the archival stage takes about 3 seconds instead of 30; it does temporarily double the world size on the host since it's no longer being piped directly through compression and into the destination folder on a CIFS share, but in my case this is not an issue.

A more complete solution might be to use or create an NBT validation tool to check that the backed up files are in fact not corrupted, or use tree to know exactly which files changed so you can judge whether the changed files are a "false positive" level.dat or player data file, or if there's actual corruption happening elsewhere.

chylex commented 2 years ago

That is annoying that save-off doesn't truly tell the server to stop all saves; however, if it's a balance of chunk memory management then it makes sense.

Chunk saving is disabled during /save-off, only /save-all is able to override it. Only level.dat and playerdata/*.dat files ignore /save-off.

What suggestions do you have for going forward? Should the retry logic be removed since it doesn't really help in these scenarios? Should the log message be adjusted?

Retry logic can be removed. It looks like /save-all flush blocks until everything is actually saved (even though the actual saving happens in a worker thread), so I also removed all arbitrary sleeps and the sync command since neither should be necessary.

I think the most robust (and also most time-consuming) solution would be a validation step after the backup.

Ignoring tar errors and using tree or a similar command to find which files actually changed, then perhaps filtering out level.dat and playerdata folder to avoid these "false positives", and logging the rest could be enough. I'm not sure how others monitor their containers, my solution is a cron job that checks container logs and sends an email if it finds anything suspicious.

itzg commented 2 years ago

@chylex I was looking back at this one and was thinking it would be great if you could PR some of the changes you made locally.

Otherwise I have this queued up so it won't fall off my radar again.

chylex commented 2 years ago

Well, my version is based on probably a year old version from this repository, with anything I didn't need removed and with changes made specifically for my local environment, so I don't know how easily it could be merged.

I uploaded my current version to gist if you want to look and take anything from it.

itzg commented 2 years ago

That's totally fair. Thanks.

Ram042 commented 2 years ago

This error happens when backup process takes longer than 5 minutes. My world is 8.5G and backup takes ~6 minutes.

This can be solved in several ways:

Ignore exit code 1. File is still written (3 times because of retries). Manpage says that exit code 1 means some files differ. Ignoring error is probably not ok - files might get corrupted, but that is the easiest solution.
Backup *.dat files separately from other files. This can be done in multiple ways 2.1 Into separate archive - easy to implement, but backup 2.2 Into same archive - requires writing uncompressed tar as only uncompressed archive file can be updated. 2.3 Concatenate archives then compress ( something like { tar -cf - . ; tar -cf - . ; }|gzip. ) This requires -i option when decompressing or files from second archive will be skipped.

IMO 2.2 seems to be best solution despite it takes up to 2 world sizes of storage at worst during compression. I will submit PR for this in a while

chylex commented 2 years ago

As I said, *.dat files are saved in a way where "file changed as we read it" can be ignored assuming that your filesystem has atomic renames. On Linux, tar will either see the old version or the new version, it should not store some corrupted in-between version if the archival happens in the middle of a *.dat file write, so splitting *.dat files into separate archives doesn't make much sense to me.

weihao commented 2 years ago

@itzg i think you are closing this ticket too soon. The solution in the MR was not validated by the people who reported this issue.

itzg commented 2 years ago

That is the default workflow of GitHub. Please reopen if you want though.

itzg / docker-mc-backup

file changed as we read it #56