restic / restic

Fast, secure, efficient backup program
https://restic.net
BSD 2-Clause "Simplified" License
26.19k stars 1.55k forks source link

Linux freezing on 0.16.5 - backup #4961

Closed augustobmoura closed 2 months ago

augustobmoura commented 2 months ago

Output of restic version

restic 0.16.5 compiled with go1.22.5 on linux/amd64

What backend/service did you use to store the repository?

rclone using an alias to pcloud

Problem description / Steps to reproduce

Running restic backup on systemd process starts freezing my computer when creating and uploading the snapshot. Because the backup is scheduled it happened a few times while I was away from the computer. I use Endeavour and recently upgraded restic from the newest version distributed 0.16.5, the problem was not present on previous versions

I solved the issue temporarialy by dropping of the unit nice to 5 (didn't affect that much) and setting ionice class to BEST_EFFORT (combined with nice=5, it solved the issue) Edit 2: ionice didn't fix it

Edit 1: Adding a few more info, my storage device is BTRFS mounted on a NVME SSD, both / and /home are subvolumes of the same BTRFS filesystem

The command being run is:

restic backup \
    --one-file-system \
    --verbose \
    --exclude-caches \
    --tag "$RESTIC_BACKUP_TAG" \
    --exclude-file /root/backups/excludes.txt \
    / /home

With the following environment variables:

export RESTIC_PASSWORD_FILE=/root/backups/pass.txt

export RESTIC_REPOSITORY="rclone:backups:" 

export RESTIC_BACKUP_TAG=automatic.backup

export RESTIC_RETENTION_HOURS=1
export RESTIC_RETENTION_DAYS=180
export RESTIC_RETENTION_WEEKS=16
export RESTIC_RETENTION_MONTHS=60
export RESTIC_RETENTION_YEARS=1000

export RESTIC_PROGRESS_FPS=0.1

The configuration for rclone is the following:

[pcloud]
type = pcloud
hostname = api.pcloud.com
token = {"access_token":"<REDACTED>","token_type":"bearer","expiry":"<REDACTED>"}

[backups]
type = alias
remote = pcloud:/backups

Expected behavior

Running backup shouldn't freeze my computer

Actual behavior

Around 4mins of the process running after scanning is finished, the systems starts gradually degrading in performance and in 5mins running it freezes the whole system, including UI

Do you have any idea what may have caused this?

My CPU is an AMD Ryzen 9 5950X, with 32 threads. I suspect that it is running so much IO requests on all the threads that is starving the system to access my NVME SSD. Lowering GOMAXPROCS might also be an option, but I didn't test it, since ionice solved the issue for now

Did restic help you today? Did it make you happy in any way?

Restic is really well grounded and very stable. Its been running everyday on personal computer for more than a year without a hitch. Aside from this problem, which might be exclusive to my setup, it's been the perfect solution for backuping!

MichaelEischer commented 2 months ago

I use Endeavour and recently upgraded restic from the newest version distributed 0.16.5, the problem was not present on previous versions

Which version did you use before?

Around 4mins of the process running after scanning is finished, the systems starts gradually degrading in performance and in 5mins running it freezes the whole system, including UI

I've so far only ever seen whole system freezes on Linux when the system runs out of memory. Too much IO or CPU can leads to slowdowns or stuttering, but shouldn't be able to cause a freeze. By ioniceing restic the rest of the system probably can spend more time on swapping stuff in again.

How much RAM does your system have? Please run restic stats --mode debug to get some information on the size of your repository.

augustobmoura commented 2 months ago

Which version did you use before?

Good question, I wasn't keeping track of the version being used, but based on arch distribution history and my usual upgrade habit I guess 0.16.4

I've so far only ever seen whole system freezes on Linux when the system runs out of memory. Too much IO or CPU can leads to slowdowns or stuttering, but shouldn't be able to cause a freeze. By ioniceing restic the rest of the system probably can spend more time on swapping stuff in again.

How much RAM does your system have? Please run restic stats --mode debug to get some information on the size of your repository.

I have 64GB of RAM in total, I was also thinking it might be related to RAM, but I run it manually and kept an eye on htop, it never surpassed 18GB.

Output of stats debug:

Collecting size statistics

File Type: key
Count: 1
Total Size: 459 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: lock
Count: 1
Total Size: 189 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: index
Count: 322
Total Size: 1.214 GiB
Size                    Count
-----------------------------
      1000 - 9999 Byte  7
    10000 - 99999 Byte  18
  100000 - 999999 Byte  83
1000000 - 9999999 Byte  214
-----------------------------

File Type: data
Count: 136476
Total Size: 2.184 TiB
Size                      Count
--------------------------------
        1000 - 9999 Byte  2
      10000 - 99999 Byte  3
    100000 - 999999 Byte  17
  1000000 - 9999999 Byte  279
10000000 - 99999999 Byte  136175
--------------------------------

Blob Type: data
Count: 9623831
Total Size: 2.164 TiB
Size                    Count
-------------------------------
          10 - 99 Byte  156819
        100 - 999 Byte  2740067
      1000 - 9999 Byte  3053271
    10000 - 99999 Byte  1181566
  100000 - 999999 Byte  1686331
1000000 - 9999999 Byte  805777
-------------------------------

Blob Type: tree
Count: 18032482
Total Size: 13.105 GiB
Size                    Count
--------------------------------
          10 - 99 Byte  1
        100 - 999 Byte  16547537
      1000 - 9999 Byte  1399250
    10000 - 99999 Byte  80849
  100000 - 999999 Byte  4301
1000000 - 9999999 Byte  544
--------------------------------
augustobmoura commented 2 months ago

So, an update on this. The scheduled systemd unit run as normal at 00:00 while I was playing a game on Steam, and once again the computer froze completely, even with ionice applied.

I rebooted my system and added GOMAXPROCS=4 to my script, run again with only my browser opened, and still had the issue.

At this point I have a few hypotheses for the bug:

On my experiments, I booted on the lts branch of the kernel, and lo and behold, the backup finished without a hiccup. Now, I'm pretty sure that something is wrong on the main branch release, and somehow it is mis-interacting with all the IO restic does.

I will be testing different versions of restic tomorrow, and also spend a few days on linux-lts. I'm still not sure if restic has no part at this, so maybe it is better to keep this issue opened until I find the true culprit. Another thing that I will do is to report or at least raise the flag on this as a kernel regression, I'm not well versed on bisecting kernel problems so maybe someone else can help me.

kzdixon commented 2 months ago
  • this is a kernel regression, on my Endeavour setup I use the latest kernel release by default, so it might have happened that something changed on CPU governance or some other driver, and now every time I start backing up, the system goes into a crazy state

I think 6.10 definitely has a regression. Ever since upgrading to it from 6.9 I've been noticing freezes from certain compilations on Gentoo; Restic is also causing this for me but I don't think its a Restic-only issue since emerging the previous version of Restic also caused my system to totally lock up when xz was running.

I've made an attempt at starting a report on the bug here: https://bugzilla.kernel.org/show_bug.cgi?id=219121, for reference.

augustobmoura commented 2 months ago

I downgraded it to 6.9 and it is also working fine so far, I will try bisecting it in the weekend

Aloxaf commented 2 months ago

I have the same problem. The system freeze even if I run restic with --dry-run.

BTW, I also use btrfs + zstd.

MichaelEischer commented 2 months ago

Output of stats debug:

That repository size shouldn't be a problem for the specs of your system. Although, your repository contains insane amounts of tree blobs for some reason. But that too can't cause full system hickups.

I use Endeavour and recently upgraded restic from the newest version distributed 0.16.5, the problem was not present on previous versions

restic 0.16.5 has been the tiniest release ever in term of changes: https://github.com/restic/restic/releases/tag/v0.16.5 .

augustobmoura commented 2 months ago

Yeah, I'm pretty sure the problem is in the kernel, the scheduled backup run fine on 6.9

MichaelEischer commented 2 months ago

I'll close this issue for now as the bugzilla report seem to be nearing a conclusion.