tasket / wyng-backup

Fast backups for logical volumes & disk images
GNU General Public License v3.0
251 stars 16 forks source link

Why does wyng actually does a btrfs submodule snapshot of /var/lib/qubes? #215

Closed tlaurion closed 2 months ago

tlaurion commented 2 months ago

/usr/sbin/btrfs subvolume snapshot -r /var/lib/qubes/ /var/lib/qubes/wyng_snapshot_tmp/

is slowing down the system to turtle speed during that operation, even freezing windows until done, even though I deactivated all qubes revert snapshot, letting wyng being the sole doing snapshots (reflink snapshots, where btrfs subvolume snapshot reason of existence unknown, requiring end users to create this with script for unknown reasons as well on my side). Seems like our usage of extensive reflink clones/derivatives is diverging which hides the impact to you where it hurts my use case pretty badly.

Most of system resources are stuck in iowait for disk ops to complete while this subvolume is created, while disk images are under use (while are most of the time reflinking to other disk images because clones+CoW/specialized clones per qusal usage), which sometimes spikes to 100% for all cpu (here 500% in screenshot, but with 12 exposed cores : sometimes going up to 1200% in iowait as seen under htop (in grey for CPU summary if "Setup->Display Options->Detailed CPU time" and "Setup->Meters->Disk IO" selected for second colum), with no clue how to tweak/diagnose things further more since hidden in kernel IO AFAIK) : 2024-08-26-095427

As said in other tickets trying to optimize things, my use case:

My question here comes to light since btrfs wyng usage is to do reflink cp calls for all templates/qubes to do its ops. Why wyng does a subvolume snapshot is still unknown as of today, and this seems to be problematic altogether for reflink where IOWAIT spikes and puts the system in a really weird state I have no clue how to troubleshoot.

As per prior recommendations and findings:

Insights @tasket?

tlaurion commented 2 months ago

@Searchanik what is this archive?

tlaurion commented 2 months ago

I don't know how it get there. A guy commented when I added an issue in a repository

Sounds like your github creds have been hacked. Please hide previous comments and fix your github account. This archive seems like malware to me @Searchanik :/

@tasket you might want to hide irrelevant comments here, including this one and every other outside of OP

tlaurion commented 2 months ago

The _btrfs_subvol_snapshot function (used under_get_btrfs_generation, used under metadata_lock, then used under _btrfs_subvol_snapshot) in the wyng script does create a BTRFS subvolume snapshot (btrfs subvolume snapshot). Why?

Considering the overall design of the script, which primarily uses cp --reflink for efficient backups, the creation of a BTRFS snapshot seems redundant and potentially inefficient.

Given these points, it appears that the BTRFS subvolume snapshot creation might be an artifact of earlier development? Or a feature that wasn't fully integrated into the current backup strategy, or to accomodate btrfs send ops (which users could do as needed on his own).

It's not clear from the code why btrfs subvolume snapshot approach is kept maintained alongside the more versatile cp --reflink method. Does my reasoning makes sense @tasket ?

tasket commented 2 months ago

A subvol snapshot is created to isolate the local data set from an otherwise busy filesystem, and provide a target for checking transaction ids. This is a critical step in getting correct delta info on Btrfs (its not required on XFS because of XFS relative simplicity); without the transaction ids from the subvol snapshots, data corruption could occur in the archive volumes because Wyng would be sending the wrong data blocks.

One puzzle here is that, unlike reflinks, subvol snapshots are not supposed to be metadata-intensive at all... not until data is written to one of the related subvols causing them to diverge. It may be that creating the subvol is triggering some kind of cache flush or other housekeeping relating to the reflink ops that were done just prior. I know that there is a new default mode in Btrfs that attempts to pre-process subvol metadata in such a way to accelerate the mounting of filesystems that bear a large number of subvolumes.

Recommendations:

  1. Check that your Btrfs filesystem has plenty of free space. You shouldn't have to worry about metadata space specifically, but total available space does affect metadata allocation, similar to when NTFS appears to grind a system nearly to a halt because fragmentation and limited space create a double whammy. Also, the ssd_spread mount option behavior should fall back to ssd when free space is low.

  2. Set default _'revisions_tokeep' to 0 on your data pool using qvm-pool command, and set all of the existing private volumes to 0 using qvm-volume set vmname:private revisions_to_keep 0. You might also have to manually clean up (delete) the extra Qubes snapshots left behind in the pool directory; this would involve shutting down all VMs then removing any derivatives of "private". Qubes default currently creates a lot of snapshots; setting this to zero means only one extra snapshot is used when the VM is running and zero extra when the VM is shut down.

  3. Use btrfs filesystem defragment -fr -t 256K on the pool path on a regular basis, possibly before each backup. And remember that deduplicating Btrfs img files can have the opposite effect and greatly intensify slowdowns like what you're experiencing.

  4. Try backing up only a few or a single volume(s) at a time to see if that makes a difference. It won't reduce the scope of the subvolume snapshot but it will reduce the number of '.wyng2' snapshots that exist at any given moment. Stopping most VMs before backup will also reduce the number of snapshots in the system.

  5. Btrfs metadata profiles may have some effect, but for the record my Btrfs filesystems are using either DUP or RAID1 profiles. IDK how risky it is to switch to SINGLE mode, but a Red Hat guide says that SINGLE is actually preferred on SSDs. Similarly, the metadata nodesize (mkfs -n) may be a factor.

Number 3 hints that some automatic mitigation may be possible in Wyng by processing the volume list in small batches. It may also be possible to reduce the scope of each subvol snapshot so it only includes the volume(s) in a specific batch. However, I think the largest footprints by far are probably the Qubes extra snapshots and fragmentation.

Template variation / fragmentation being a large factor seems doubtful to me, unless the template root volumes are large; most are in the 10-20GB range which I don't regard as large.

tasket commented 2 months ago

Note to self:

Creating many smaller subvol snapshots may not be an option, as there is no technique for isolating individual vols (img files) from a large directory of vols to create subvol snapshots just for isolated vols.

Creating hard links to the target vols using existing Wyng snapshot naming could probably work: A subvol snapshot would then be created as normal, except the system overall is bearing one less reflink snapshot for each target vol. This may result in significant reduction in the metadata usage highwater mark during delta acquisition; this should be at least as effective as acquiring deltas in small batches and needs no special handling of paths in the remainder of the delta processing code. (The downside is that the subvol snapshot must be retained through the entire send/monitor procedure, and deleted only when its complete.)

tasket commented 2 months ago
  1. If you are not using wyng monitor and your backups aren't very frequent, then there is a distinct possibility that running monitor frequently could help as it was made to keep deltas between snapshots low. For example: If you encounter the system slowdown when you backup every 2 days, you could try running monitor twice per day or hourly. The resulting snapshot rotation should reduce the overall metadata burden at backup time.
tlaurion commented 2 months ago

A subvol snapshot is created to isolate the local data set from an otherwise busy filesystem, and provide a target for checking transaction ids. This is a critical step in getting correct delta info on Btrfs (its not required on XFS because of XFS relative simplicity); without the transaction ids from the subvol snapshots, data corruption could occur in the archive volumes because Wyng would be sending the wrong data blocks.

wyng taking disk images snapshots and qubes still doing snapshots and rotation of disk images shoud take care of this without needing btrfs subvolule snapshots, no?

One puzzle here is that, unlike reflinks, subvol snapshots are not supposed to be metadata-intensive at all... not until data is written to one of the related subvols causing them to diverge. It may be that creating the subvol is triggering some kind of cache flush or other housekeeping relating to the reflink ops that were done just prior. I know that there is a new default mode in Btrfs that attempts to pre-process subvol metadata in such a way to accelerate the mounting of filesystems that bear a large number of subvolumes.

Recommendations:

1. Check that your Btrfs filesystem has plenty of free space.  You shouldn't have to worry about metadata space specifically, but total available space does affect metadata allocation, similar to when NTFS appears to grind a system nearly to a halt because fragmentation and limited space create a double whammy.  Also, the `ssd_spread` mount option behavior should fall back to `ssd` when free space is low.
$ mount | grep btrfs
/dev/mapper/luks-09f96e74-eb7e-4058-b3f4-e4f0ca8492fd on / type btrfs (rw,relatime,ssd_spread,discard,space_cache=v2,subvolid=256,subvol=/root)
2. Set default _'revisions_to_keep'_ to 0 on your data pool using `qvm-pool` command, and set all of the existing private volumes to 0 using `qvm-volume set vmname:private revisions_to_keep 0`.  You might also have to manually clean up (delete) the extra Qubes snapshots left behind in the pool directory; this would involve shutting down all VMs then removing any derivatives of "private".  Qubes default currently creates a _lot_ of snapshots; setting this to zero means only one extra snapshot is used when the VM is running and zero extra when the VM is shut down.
(2)$ qvm-pool -i varlibqubes
name                varlibqubes
dir_path            /var/lib/qubes
driver              file-reflink
ephemeral_volatile  False
revisions_to_keep   0
size                1973610872832
usage               608703774720

No problem of free space.

3. Use `btrfs filesystem defragment -fr -t 256K` on the pool path on a regular basis, possibly before each backup.  And remember that deduplicating Btrfs img files can have the opposite effect and greatly intensify slowdowns like what you're experiencing.

Ha. Did this once and forgot about that. But again, as reminded here, this is incompatible with deduplication, but isn't used on this installation i'm typing from, so cannot be the source of issue outside of clones+specialized templates/qubes from reflinked disk images/bad config.

4. Try backing up only a few or a single volume(s) at a time to see if that makes a difference.  It won't reduce the scope of the subvolume snapshot but it will reduce the number of '.wyng2' snapshots that exist at any given moment.  Stopping most VMs before backup will also reduce the number of snapshots in the system.

Will do and report back.

5. Btrfs metadata profiles may have some effect, but for the record my Btrfs filesystems are using either DUP or RAID1 profiles.  IDK how risky it is to switch to SINGLE mode, but a Red Hat guide says that SINGLE is actually [preferred](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/storage_administration_guide/btrfs-ssd-optimization#btrfs-ssd-optimization) on SSDs.  Similarly, the metadata _nodesize_ (mkfs -n) may be a factor.

That was switched from RAID1 (default Q4.2) to SINGLE already. Only DATA and SYSTEM is DUP and considered good on SSD drives. Question is how to enforce proper defaults into Q4.3, if BTRFS perf are comparable or improved compared to lvm2. As of now, this is unfortunately not the case with my use case (nvme drive on nv41 laptop with 12 cores, qusal deployed so cloned+specialized reflinked disk images, with qubesos still rotating disk images where wyng backups state that is prior of live state, so still not understanding why btrfs subvolume snapshot required, if wyng takes snapshot of disk image and compare old snapshot with new snapshot to send backup delta anyway).

Number 3 hints that some automatic mitigation may be possible in Wyng by processing the volume list in small batches. It may also be possible to reduce the scope of each subvol snapshot so it only includes the volume(s) in a specific batch. However, I think the largest footprints by far are probably the Qubes extra snapshots and fragmentation.

Template variation / fragmentation being a large factor seems doubtful to me, unless the template root volumes are large; most are in the 10-20GB range which I don't regard as large.

I have some qubes that are pretty heavy (250Gb+) but the slowdown in wyng are observed in btrfs submodule clone and basic templates, not even the specialized derivatives. But more testing needed with observation poi ts. Once again here, its hard to troubleshoot without more verbosity added into the pre-backup phase, where wyng only says that its preparing for backup, without detailing time consumed in each phase. In my opinion, that would greatly help to give output on what is happening there, or some insights on iowait analysis to see what is happening under the hood and understand why the slowdowns occurs.


@tasket I will present on my different PoCs (rpi5+softraid wyng storage, veeble: to be reused/expended to offer disk states as a firmware service I poked you before for joint grant application at https://github.com/QubesOS/qubes-issues/issues/858#issuecomment-2094344093) at QubesOS mini-summit. If you can free yourself in presentation time window, that would be awesome to have you there (ideal would be at least for Q&A on Sept 21 2024 where details can be seen under https://cfp.3mdeb.com/qubes-os-summit-2024/talk/J8TQEU/ )

tasket commented 2 months ago

@tlaurion I added volume/subvol debug mode output to the update in 08wip branch. I'll take a look at he calendar but I should be free for at least Sept 21.

tlaurion commented 2 months ago

@tasket this debug output will be pretty sensitive.

Have you fixed your Element setup so I can share logs there after next run of tests with your latest commit?

tlaurion commented 2 months ago

@tasket this debug output will be pretty sensitive.

Have you fixed your Element setup so I can share logs there after next run of tests with your latest commit?

@tasket Here is gpg encrypted+compressed tar.gz of log encrypted toward proton mail email recipient linked to your public key in this repo: wyng_debug_with_ts.tar.gz

My public key can be downloaded through:

tlaurion commented 2 months ago

@tasket this debug output will be pretty sensitive.

Have you fixed your Element setup so I can share logs there after next run of tests with your latest commit?

@tasket Here is gpg encrypted+compressed tar.gz (so this is accepted by GitHub) of log(txt) encrypted+compressed by gpg, encrypted for your proton mail email recipient linked to your public key in this repo: wyng_debug_with_ts.tar.gz Edit: this file has the whole complete lis of running qubes prior of backup send, list of all qubes disk images under /var/lib/qubes with log output piped through ts (wyng blah | ts)

My public key can be downloaded through:


Some analysis and script used to do so.

Script for wyng output ( wyng blah | ts > log.txt 2&>1 )

import datetime
import heapq
import sys
import os

def parse_timestamp(timestamp):
    return datetime.datetime.strptime(timestamp, "%b %d %H:%M:%S")

def process_log_file(file_path, top_n=10):
    time_differences = []
    lines = []
    timestamps = []

    with open(file_path, 'r') as file:
        for line_number, line in enumerate(file, 1):
            lines.append((line_number, line.strip()))
            timestamp_str = line[:15]

            try:
                timestamp = parse_timestamp(timestamp_str)
                timestamps.append((line_number, timestamp))
            except ValueError:
                continue  # Skip lines with invalid timestamps

    for i in range(1, len(timestamps)):
        prev_num, prev_time = timestamps[i-1]
        curr_num, curr_time = timestamps[i]
        time_diff = curr_time - prev_time
        heapq.heappush(time_differences, (time_diff.total_seconds(), prev_num, curr_num))

    return heapq.nlargest(top_n, time_differences), lines

def print_usage():
    print("Usage: python script_name.py <log_file_path> [number_of_results]")
    print("\nThis script analyzes a log file with timestamps and identifies the largest time gaps between consecutive entries.")
    print("It displays these time differences sorted from largest to smallest, showing two lines before and two lines after each gap.")
    print("\nArguments:")
    print("  <log_file_path>     : Path to the log file to analyze")
    print("  [number_of_results] : (Optional) Number of top time differences to display (default: 10)")
    print("\nEach line in the log file should start with a timestamp in the format: 'Aug 30 14:45:30'")

def main():
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print_usage()
        sys.exit(1)

    file_path = sys.argv[1]

    if not os.path.exists(file_path):
        print(f"Error: The file '{file_path}' does not exist.")
        print_usage()
        sys.exit(1)

    top_n = 10  # Default value
    if len(sys.argv) == 3:
        try:
            top_n = int(sys.argv[2])
        except ValueError:
            print("Error: The number of results must be an integer.")
            print_usage()
            sys.exit(1)

    results, lines = process_log_file(file_path, top_n)

    print(f"\nTop {top_n} largest time differences in '{file_path}':")
    print("-----------------------------------")
    for i, (diff_seconds, prev_line_num, curr_line_num) in enumerate(results, 1):
        print(f"{i}. Time difference: {diff_seconds:.2f} seconds")
        print(f"   Between lines {prev_line_num} and {curr_line_num}")
        print("   Context:")

        start = max(0, prev_line_num - 3)
        end = min(len(lines), curr_line_num + 2)

        for j in range(start, end):
            line_num, line_content = lines[j]
            prefix = "   " if j not in (prev_line_num - 1, curr_line_num - 1) else ">> "
            print(f"{prefix}Line {line_num}: {line_content}")

        print()

if __name__ == "__main__":
    main()

Ouput of script on provided logs (some output redacted):

$ python log_analyser.py wyng_debug_with_ts.txt
Top 10 largest time differences in 'wyng_debug_with_ts.txt':
-----------------------------------
1. Time difference: 450.00 seconds
   Between lines 17491 and 17492
   Context:
   Line 17489: Aug 30 14:46:17   Queuing full scan of import 'wyng-qubes-metadata'
   Line 17490: Aug 30 14:46:17 Acquiring deltas.
>> Line 17491: Aug 30 14:46:17 Subvol snapshot: '/var/lib/qubes/wyng_snapshot_tmp/' False del=False
>> Line 17492: Aug 30 14:53:47 *generation ['ID 260 gen 358162 top level 259 path root/var/lib/qubes/wyng_snapshot_tmp']
   Line 17493: Aug 30 14:53:47 Created subvol (sec): 450.8951686869999
   Line 17494: Aug 30 14:53:47 Get rlnk deltas for appvms/XXX/private.img

2. Time difference: 294.00 seconds
   Between lines 17690 and 17691
   Context:
   Line 17688: no change:56:29
   Line 17689: 24.2MB | 61%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmpj1rdkzsl)"']
>> Line 17690: Aug 30 14:56:34 Rotate Snapshots appvms/heads-tests-deb12-nix/private.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_4e1b26/S_20240830-144528_85.deltamap ['--addtag=S_20240830-144528']
>> Line 17691: Aug 30 15:01:28 *rename /var/lib/qubes/appvms/heads-tests-deb12-nix/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4e1b26.wyng2 /var/lib/qubes/appvms/heads-tests-deb12-nix/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4e1b26.wyng1
   Line 17692: 24.3MB |4m58s
   Line 17693: Aug 30 15:01:28  (reduced 1.1MB)

3. Time difference: 53.00 seconds
   Between lines 17683 and 17684
   Context:
   Line 17681: no change:55:35
   Line 17682: ['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmp9iexc9gc)"']
>> Line 17683: Aug 30 14:55:36 Rotate Snapshots appvms/heads-tests-d12-nix_cryptsetup/private.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_824550/S_20240830-144528_81.deltamap ['--addtag=S_20240830-144528']
>> Line 17684: Aug 30 14:56:29 *rename /var/lib/qubes/appvms/heads-tests-d12-nix_cryptsetup/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_824550.wyng2 /var/lib/qubes/appvms/heads-tests-d12-nix_cryptsetup/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_824550.wyng1
   Line 17685: 0.0MB | 54s
   Line 17686: Aug 30 14:56:29   Memory use: Max 1129MB

4. Time difference: 36.00 seconds
   Between lines 17557 and 17559
   Context:
   Line 17555: Aug 30 14:54:21
   Line 17556: Aug 30 14:54:21 Sending backup session 20240830-144528:
>> Line 17557: Aug 30 14:54:21 —————————————————————————————————————————————————————————————————————————
   Line 17558: I/g 30 14:54:57           |     | wyng-qubes-metadata
>> Line 17559: Aug 30 14:54:57 Indexed in 34.9 seconds.
   Line 17560: Aug 30 14:54:57
   Line 17561: Aug 30 14:54:57 Memory use: Max 1119MB, index count: 5291147

5. Time difference: 17.00 seconds
   Between lines 17976 and 17978
   Context:
   Line 17974: Aug 30 15:02:51   Memory use: Max 1129MB
   Line 17975: Aug 30 15:02:51   Current: vsize 512, rsize 267
>> Line 17976: Aug 30 15:02:51
   Line 17977: 11.6MB | 80%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmp516m9krp)"']
>> Line 17978: Aug 30 15:03:08 Rotate Snapshots vm-templates/whonix-workstation-17/root.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_490ecd/S_20240830-144528_78.deltamap ['--addtag=S_20240830-144528']
   Line 17979: Aug 30 15:03:09 *rename /var/lib/qubes/vm-templates/whonix-workstation-17/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_490ecd.wyng2 /var/lib/qubes/vm-templates/whonix-workstation-17/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_490ecd.wyng1
   Line 17980: 12.6MB | 18s

6. Time difference: 15.00 seconds
   Between lines 17710 and 17713
   Context:
   Line 17708: 0.0MB |  1s
   Line 17709: Aug 30 15:01:35   Memory use: Max 1129MB
>> Line 17710: Aug 30 15:01:35   Current: vsize 512, rsize 265
   Line 17711: no change:01:35
   Line 17712: 85.4MB | 71%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmpgw7qgu9v)"']
>> Line 17713: Aug 30 15:01:50 Rotate Snapshots appvms/personal/private.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_1defdd/S_20240830-144528_83.deltamap ['--addtag=S_20240830-144528']
   Line 17714: Aug 30 15:01:54 *rename /var/lib/qubes/appvms/personal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_1defdd.wyng2 /var/lib/qubes/appvms/personal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_1defdd.wyng1
   Line 17715: 91.7MB | 19s

7. Time difference: 15.00 seconds
   Between lines 352 and 354
   Context:
   Line 350: [user@dom0 ~]
   Line 351: $ time sudo wyng-util-qubes --dest qubes-ssh://XXX:XXX@XXX/XXX --authmin 1080 --includes -w remap --dedup --include-disposable=on -wtag='sys-cacher,enabled' -wtag='qusal,with sys-audio working' -wtag='daily' backup --debug --verbose -wdebug -wverbose | ts
>> Line 352: Aug 30 14:45:13 wyng-util-qubes v0.9 beta rel 20240531
   Line 353: Enter passphrase:
>> Line 354: Aug 30 14:45:28 ['/usr/sbin/wyng', '--dest=qubes-ssh://XXX:XXX@XXX/XXX', '--authmin=1080', '--remap', '--tag=sys-cacher,enabled', '--tag=qusal,with sys-audio working', '--tag=daily', '--debug', '--verbose', '--json', 'list']
   Line 355: Aug 30 14:45:28
   Line 356: Aug 30 14:45:28 Selected for backup:

8. Time difference: 7.00 seconds
   Between lines 17958 and 17960
   Context:
   Line 17956: Aug 30 15:02:42   Memory use: Max 1129MB
   Line 17957: Aug 30 15:02:42   Current: vsize 512, rsize 266
>> Line 17958: Aug 30 15:02:42
   Line 17959: 40.5MB | 81%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmpkox5a2nx)"']
>> Line 17960: Aug 30 15:02:49 Rotate Snapshots vm-templates/whonix-gateway-17/root.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_8f50a1/S_20240830-144528_78.deltamap ['--addtag=S_20240830-144528']
   Line 17961: Aug 30 15:02:50 *rename /var/lib/qubes/vm-templates/whonix-gateway-17/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_8f50a1.wyng2 /var/lib/qubes/vm-templates/whonix-gateway-17/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_8f50a1.wyng1
   Line 17962: 43.3MB |  8s

9. Time difference: 7.00 seconds
   Between lines 17587 and 17590
   Context:
   Line 17585: 0.0MB |  1s
   Line 17586: Aug 30 14:55:00   Memory use: Max 1126MB
>> Line 17587: Aug 30 14:55:00   Current: vsize 512, rsize 288
   Line 17588: no change:55:00
   Line 17589: 44.3MB | 50%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmpx8nfw1fs)"']
>> Line 17590: Aug 30 14:55:07 Rotate Snapshots appvms/Insurgo/private.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_4f84be/S_20240830-144528_100.deltamap ['--addtag=S_20240830-144528']
   Line 17591: Aug 30 14:55:08 *rename /var/lib/qubes/appvms/Insurgo/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4f84be.wyng2 /var/lib/qubes/appvms/Insurgo/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4f84be.wyng1
   Line 17592: 44.5MB |  9s

10. Time difference: 6.00 seconds
   Between lines 17449 and 17450
   Context:
   Line 17447: Aug 30 14:46:04 Del vol vm-templates/kali-core/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_6e49b6.wyng2
   Line 17448: Aug 30 14:46:05 Create vol vm-templates/kali-core/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_6e49b6.wyng2 from vm-templates/kali-core/root.img
>> Line 17449: Aug 30 14:46:06 Del vol appvms/personal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_1defdd.wyng2
>> Line 17450: Aug 30 14:46:12 Create vol appvms/personal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_1defdd.wyng2 from appvms/personal/private.img
   Line 17451: Aug 30 14:46:14 Del vol appvms/qusal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_3a3b3a.wyng2
   Line 17452: Aug 30 14:46:14 Create vol appvms/qusal/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_3a3b3a.wyng2 from appvms/qusal/private.img

Analysis:

@tasket as said previously, outside of thebtrfs subvolume snapshot taking 450 seconds, the disk rotation of images is taking the most time in this top 10. Those disk images are either origins of clones+specialization (reflinked+CoW), or the biggest qubes I have (heads testing qubes which were clones, or clones of derivates, personal being my biggest disk image, etc. My original encrypted log in this post would show their sizes. Note that nothing here were running qubes, as shown also per encrypted log qvm-ls --running output).

Unfortunately, I do not yet know enough of btrfs to know how to output how many blocks are common between disk images that are reflinked. Doing a btrfs defrag would resolve the slowness issue, but would expand disk usage for each disk image(less free space on pool) while also voiding reflinked blocks per past experiments and documentation and observable augmented disk image consumed space.

@tasket thoughts?

EDIT: hideen past posts.

tlaurion commented 2 months ago

Improved code (prior was pointing to wrong lines and gave somewhat wrong time diffs picking on lines not having timestamps), if somewhat useful to you @tasket.

I think that using ts for timestamp analysis might be more relevant outside of your code then within it, maybe? Feel free to reuse/modify/trash/upstream:

import datetime
import heapq
import sys
import os
import re
import argparse

def parse_timestamp(line):
    match = re.match(r'(\w{3} \d{2} \d{2}:\d{2}:\d{2})', line)
    if match:
        return datetime.datetime.strptime(match.group(1), "%b %d %H:%M:%S")
    return None

def process_log_file(file_path, top_n=10):
    time_differences = []
    lines_with_timestamps = []
    all_lines = []
    prev_timestamp = None
    prev_line_num = None

    try:
        with open(file_path, 'r') as file:
            for line_number, line in enumerate(file, 1):
                all_lines.append((line_number, line.strip()))
                current_timestamp = parse_timestamp(line)

                if current_timestamp:
                    lines_with_timestamps.append((line_number, line.strip()))
                    if prev_timestamp:
                        time_diff = current_timestamp - prev_timestamp
                        time_differences.append((time_diff.total_seconds(), prev_line_num, line_number))
                    prev_timestamp = current_timestamp
                    prev_line_num = line_number
    except Exception as e:
        print(f"Error processing file: {e}")
        sys.exit(1)

    return heapq.nlargest(top_n, time_differences), all_lines, lines_with_timestamps

def main():
    parser = argparse.ArgumentParser(description="Analyze time differences in log files.")
    parser.add_argument("file_path", help="Path to the log file")
    parser.add_argument("-n", "--num_results", type=int, default=10, 
                        help="Number of top time differences to display (default: 10)")
    parser.add_argument("-v", "--verbose", action="store_true", 
                        help="Display additional information about the analysis")

    args = parser.parse_args()

    if not os.path.exists(args.file_path):
        print(f"Error: The file '{args.file_path}' does not exist.")
        sys.exit(1)

    results, all_lines, lines_with_timestamps = process_log_file(args.file_path, args.num_results)

    print(f"\nTop {args.num_results} largest time differences in '{args.file_path}':")
    print("-----------------------------------")
    for i, (diff_seconds, prev_line_num, curr_line_num) in enumerate(results, 1):
        print(f"{i}. Time difference: {diff_seconds:.2f} seconds")
        print(f"   Between lines {prev_line_num} and {curr_line_num}")
        print("   Context:")

        prev_index = next(index for index, (ln, _) in enumerate(all_lines) if ln == prev_line_num)
        curr_index = next(index for index, (ln, _) in enumerate(all_lines) if ln == curr_line_num)

        start = max(0, prev_index - 2)
        end = min(len(all_lines), curr_index + 3)

        for j in range(start, end):
            line_num, line_content = all_lines[j]
            prefix = ">> " if line_num in (prev_line_num, curr_line_num) else "   "
            print(f"{prefix}Line {line_num}: {line_content}")

        print()

    if args.verbose:
        print(f"Total lines processed: {len(all_lines)}")
        print(f"Lines with timestamps: {len(lines_with_timestamps)}")
        print(f"Total time differences analyzed: {len(results)}")

if __name__ == "__main__":
    main()
tlaurion commented 2 months ago

@tasket by doubling dom0's max mem under /etc/default/grub (dom0_mem=max:8192M) and regenerating grub and redoing wyng backup and running above script on log (redacted)

user@heads-tests-deb12-nix:~/QubesIncoming/dom0$ python log_analyser16.py -n 10 wyng_debug_with_ts_doubled_dom0_memory.txt 

Top 10 largest time differences in 'wyng_debug_with_ts_doubled_dom0_memory.txt':
-----------------------------------
1. Time difference: 253.00 seconds
   Between lines 18206 and 18207
   Context:
   Line 18204: Aug 30 17:07:25   Queuing full scan of import 'wyng-qubes-metadata'
   Line 18205: Aug 30 17:07:25 Acquiring deltas.
>> Line 18206: Aug 30 17:07:25 Subvol snapshot: '/var/lib/qubes/wyng_snapshot_tmp/' False del=False
>> Line 18207: Aug 30 17:11:38 *generation ['ID 260 gen 358382 top level 259 path root/var/lib/qubes/wyng_snapshot_tmp']
   Line 18208: Aug 30 17:11:38 Created subvol (sec): 252.61327033200007
   Line 18209: Aug 30 17:11:38 Get rlnk deltas for appvms/XXX/private.img

2. Time difference: 38.00 seconds
   Between lines 18272 and 18274
   Context:
   Line 18270: Aug 30 17:12:02
   Line 18271: Aug 30 17:12:02 Sending backup session 20240830-170648:
>> Line 18272: Aug 30 17:12:02 —————————————————————————————————————————————————————————————————————————
   Line 18273: I/g 30 17:12:40           |     | wyng-qubes-metadata
>> Line 18274: Aug 30 17:12:40 Indexed in 37.1 seconds.
   Line 18275: Aug 30 17:12:40
   Line 18276: Aug 30 17:12:40 Memory use: Max 1119MB, index count: 5298060

3. Time difference: 13.00 seconds
   Between lines 306 and 308
   Context:
   Line 304: [user@dom0 ~]
   Line 305: $ time sudo wyng-util-qubes --dest qubes-ssh://XXX:XXX@XXX/XXX --authmin 1080 --includes -w remap --dedup --include-disposable=on -wtag='sys-cacher,enabled' -wtag='qusal,with sys-audio working' -wtag='daily' backup --debug --verbose -wdebug -wverbose | ts
>> Line 306: Aug 30 17:06:34 wyng-util-qubes v0.9 beta rel 20240531
   Line 307: Enter passphrase:
>> Line 308: Aug 30 17:06:47 ['/usr/sbin/wyng', '--dest=qubes-ssh://XXX:XXX@XXX/XXX', '--authmin=1080', '--remap', '--tag=sys-cacher,enabled', '--tag=qusal,with sys-audio working', '--tag=daily', '--debug', '--verbose', '--json', 'list']
   Line 309: Aug 30 17:06:47
   Line 310: Aug 30 17:06:47 Selected for backup:

4. Time difference: 6.00 seconds
   Between lines 18400 and 18403
   Context:
   Line 18398: 0.0MB |  5s
   Line 18399: Aug 30 17:13:11   Memory use: Max 1126MB
>> Line 18400: Aug 30 17:13:11   Current: vsize 512, rsize 288
   Line 18401: no change:13:11
   Line 18402: 29.1MB | 55%['/usr/bin/qvm-run', '--no-color-stderr', '--no-color-output', '-p', 'XXX', '/usr/bin/ssh -x -T -e none -o ControlPath=~/.ssh/ctrl-%C -o ControlMaster=auto -o ControlPersist=120 -o ServerAliveInterval=60 -o ConnectTimeout=60 -o Compression=no ssh://XXX@XXX "$(/usr/bin/cat /tmp/wyngrpc/tmp5lu_33g5)"']
>> Line 18403: Aug 30 17:13:17 Rotate Snapshots appvms/heads-tests-deb12-nix/private.img True /var/lib/wyng/a_b5326b6fc11bcad3d7f6eb9dd102d9503c547251/Vol_4e1b26/S_20240830-170648_86.deltamap ['--addtag=S_20240830-170648']
   Line 18404: Aug 30 17:13:21 *rename /var/lib/qubes/appvms/heads-tests-deb12-nix/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4e1b26.wyng2 /var/lib/qubes/appvms/heads-tests-deb12-nix/sne5697b46-4a31-4647-88f8-0fcad4ff9e36_Vol_4e1b26.wyng1
   Line 18405: 29.6MB | 11s

That was done on a fresh reboot, where htop sorts on total CPU time.

While wyng op ongoing: 2024-08-30-170828 After wyng finished (still iowait!): 2024-08-30-171438

Analysis: swap still touched, most time consumed (in still running process) are krunner related, where while wyng was happening, it was used by btrfs-transaction and fs related crypt etc as can be seen). Running in dedup send mode, as I do not see why I wouldn't. 12th gen cpu, 64Gb ram. I can spare resources, but i'm not sure what is happening here.

Again, thoughts welcome. I would try to make btrfs subvolume snapshot no-op, but I wouldn't venture there until you arrive to to same conclusion that under QubesOS use case: this is considered irrelevant as well. I seem to still miss the point. Logs would talk better alongside you analysis; this is ressource tugging where I still do not see the point unless btrfs send is expected to work alongside of wyng.

tlaurion commented 2 months ago

This changed everything.

$ diff /etc/fstab.bak /etc/fstab
12c12
< UUID=a8b2c51e-f325-4647-8fc2-bc7e93f49645         /                       btrfs   subvol=root,x-systemd.device-timeout=0,ssd_spread,space_cache=v2,discard 0 0
---
> UUID=a8b2c51e-f325-4647-8fc2-bc7e93f49645         /                       btrfs   subvol=root,x-systemd.device-timeout=0,noatime,compress=zstd,ssd,space_cache=v2,discard=async,noautodefrag 0 0

Seems like ssd_spread and discard were the issue as well as defrag (which is enemy of reflink from what I got) and opposite of dedup, causing overamplification.

Will close for now.

kennethrrosen commented 2 months ago

This changed everything.

0,noatime,compress=zstd,ssd,space_cache=v2,discard=async,noautodefrag 0 0

@tlaurion do you have any performance stats from this change? I've been using btrfs_defrag between backups but would like to give this a try instead. Thanks for all your testing on this, too.

P.S. Keen to hear your presentation at this months summit!

tlaurion commented 2 months ago

This changed everything.

0,noatime,compress=zstd,ssd,space_cache=v2,discard=async,noautodefrag 0 0

@tlaurion do you have any performance stats from this change? I've been using btrfs_defrag between backups but would like to give this a try instead. Thanks for all your testing on this, too.

P.S. Keen to hear your presentation at this months summit!

Just no more iowait. 2 minutes on backup send, btrfs subvolume snapshot doesn't lock, I'm only limited by network speeds and have little slowdown if using reflinked clones.

Defrag, AFAIK is voiding reflink gains by expending qubes disk images consuming more space. Never an defrag, and not sure why @tasket suggests it.

I get now why the subvolume snapshot is important following his first answer, but the defrag advice is still a mystery to me. Also no gain doing a balance unless multiple disks on btrfs volume. Cancelling qubesos revert capability also helped on performance.

Pretty hard to get proper numbers here to be honest @kennethrrosen !

tasket commented 2 months ago

@tlaurion @kennethrrosen Defrag has two major effects: Placing data blocks adjacent to each other (enhancing sequential reads, at least for HDDs), and reducing the complexity of metadata that tracks the fragments (possibly by orders of magnitude). What I would consider a minor effect is that on a CoW fs more space will be used for data; I've seen usage increase by 3-4GB with 500GB of allocated data, for instance, YMMV.

I can accept that certain conditions might invert the costs/gains relative to Btrfs behaviors. A deduped fs might end up with many fewer logical extents to manage if the volume data is very repetitive but carrying many dispersed, smaller, CoW-updated permutations. Keeping an eye on the relative data-to-metadata allocations on your fs (using btrfs filesystem df) would be one way to get accurate numbers when testing under various conditions. I have also added a monotonic timer to Wyng's debug output when it creates subvol snapshots, which can add to the available numbers. Btrfs tools can also report on the amount of fragmentation and other factors. Of course, actually recreating conditions on multiple machines when the data sets are in the terabyte range is a very challenging proposition.

BTW, Dedup is actually compatible with defragging, provided you give it parameters that prevent it from rewriting chunks smaller than defrag's threshold (usually 256KB). Its also possible to try smaller thresholds for both. Note that auto-defrag and auto-dedup can significantly increase overall drive activity and have a write-amplification effect; this is why I advise weekly or monthly use of defrag instead.

As to what actually made the difference for you, it might be dedup (which would be very much a corner case IMHO) or perhaps switching from _ssdspread to ssd allocation. It seems unlikely that Btrfs was defaulting to using autodefrag or a non-v2 space cache settings, but if that was the case they would have a large impact on performance.