psy0rz / zfs_autobackup

ZFS autobackup is used to periodicly backup ZFS filesystems to other locations. Easy to use and very reliable.
https://github.com/psy0rz/zfs_autobackup
GNU General Public License v3.0
583 stars 62 forks source link

Argument list too long #253

Closed EagTG closed 4 months ago

EagTG commented 4 months ago

Hi there,

Love this script so far. I started implementing it on some of our larger ZFS deployments and ran into an issue. I'm certain it's due the thousands of datasets on this particular ZFS environment. I've also found that it seems to relate to the maximum command line argument length in BASH (not a direct issue with zfs-autobackup), so this might be an enhancement request, possibly to ask that zfs-autobackup split the command lines when they are long?

Also would welcome any other work-arounds. I've tried things like increasing the ulimit setting (found via similar issues with other things in BASH), but they haven't worked to fix this issue. I'm considering trying some kludgy alpha-splits just to get around the issue temporarily.

I also feel that I could work around the issue by not using the --allow-empty parameter as that would reduce the number of datasets it's snapshotting dramatically, but I would like to keep all of my snapshot names consistent through all of the datasets.

The environment is Proxmox 7.4.

Command line I used:

/usr/local/bin/zfs-autobackup -v --debug --no-thinning --clear-mountpoint --allow-empty     \
        --strip-path=1 --snapshot-format=%y.%m.%d-%a-%H.00 --compress            \
        --keep-source=15,1d1w,1w1m,1m1y --keep-target=30,1d2w,1w1m,1m2y          \
        --buffer=128M --ssh-source=[server name redacted] s0_to_s1 pools1

Error Received:

! Exception: [Errno 7] Argument list too long: b'ssh'
Traceback (most recent call last):
  File "/usr/local/bin/zfs-autobackup", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/ZfsAutobackup.py", line 542, in cli
    failed_datasets=ZfsAutobackup(sys.argv[1:], False).run()
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/ZfsAutobackup.py", line 459, in run
    source_node.consistent_snapshot(source_datasets, snapshot_name,
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/ZfsNode.py", line 228, in consistent_snapshot
    self.run(cmd, readonly=False)
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/ExecuteNode.py", line 176, in run
    if not cmd_pipe.execute():
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/CmdPipe.py", line 113, in execute
    selectors = self.__create()
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/CmdPipe.py", line 183, in __create
    item.create(next_stdin)
  File "/usr/local/lib/python3.9/dist-packages/zfs_autobackup/CmdPipe.py", line 64, in create
    self.process = subprocess.Popen(encoded_cmd, env=os.environ, stdout=subprocess.PIPE, stdin=stdin,
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1823, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: b'ssh'

The output from zfs-autobackup right before the failure:

  #### Snapshotting
  [Source] Creating snapshots 24.05.09-Thu-18.00 in pool pools0
# [Source] CMD    > (ssh [server name redacted] 'zfs snapshot [... long output redacted ...]

The command it's trying to run (at 'long output redacted') is 165,983 bytes. Unfortunately, there is some proprietary information contained in the ZFS dataset names that I'd rather not share here.

Happy to provide additional information on-request. Thanks in advance!

Edit: Forgot to mention, this appears to relate to BASH MAX_ARG_STRLEN. Looks like it's 128 KB by default.

EagTG commented 4 months ago

Also, confirming that for this particular workload dropping '--allow-empty' seems to have worked around the issue as not all of our thousands of datasets contain updated data, therefore making the resulting 'zfs snapshot' command shorter (at least I'm assuming so, I missed that in the log, not sure if it still sends the zfs snapshot all-at-once, or if it breaks it up).

psy0rz commented 4 months ago

To make a consistent snapshot we need to call zfs snapshot with all the datasets at once.

One option would be to create the snapshots yourself with zfs snapshot -r for example and then run zfs-autobackup with --no-snapshot

(And perhaps zfs destroy the snapshots you dont want before calling it)

psy0rz commented 4 months ago

Great!

Ps same person, different account?

EagTG commented 4 months ago

Hahah, yes, my mistake.

Posting again from the proper account.

Thanks psy0rz, that seems to work.

I created a BASH file that generates thousands of datasets in a test ZFS environment and was able to replicate the issue. I then modified my process to call the snapshot directly via ZFS first:

#!/bin/bash
DATEFMT=$(date +%y.%m.%d-%a-%H.00)
echo \=\=\=\> ${DATEFMT}

/usr/bin/ssh username@[server name redacted] "/usr/sbin/zfs snapshot -r poolt0@${DATEFMT}"

And then run the modified zfs-autobackup command (including the --no-snapshot and --allow-empty parameters):

/usr/local/bin/zfs-autobackup -v --debug --no-thinning --clear-mountpoint --no-snapshot \
    --strip-path=1 --snapshot-format=%y.%m.%d-%a-%H.00 --compress --allow-empty         \
    --keep-source=15,1d1w,1w1m,1m1y --keep-target=30,1d2w,1w1m,1m2y                     \
    --buffer=128M --ssh-source=[server name redacted] t0_to_t1 poolt1

(Naturally, removing the -v and --debug from the cron version).

This seems to do exactly what I want, thanks for the suggestion!

I eventually want to enable the Thinner as well, and will be running a few additional tests to confirm the snapshot deletion via the Thinner is doing what I want. I expect it will, as the snapshot naming from the BASH file is consistent with --snapshot-format.

Thanks again!