Closed rokyo249 closed 1 year ago
FWIW I'm seeing something similar and am not using systemd timers, I have a */15
cron job running /usr/sbin/sanoid --cron
. My template is:
[template_production]
frequent_period = 15
frequently = 0
hourly = 48
daily = 7
weekly = 4
monthly = 2
yearly = 0
autosnap = yes
autoprune = yes
Since frequently
is set to 0
(like you) I'd expect only hourly snapshots at the top of the hour -- outside of initial startup. Yet I get seemingly random */15
snapshots taken also. I'm assuming things like hourly_min = 0
in the default conf file are defaults, but perhaps not?
With the frequent_period = 60 I assumed, it would only take one hourly per hour.
This setting doesn't mean what you think it means. Frequent_period is a user-definable period for those who want automatic snapshots taken more than once per hour. What you've done is essentially tell your system that you want the period for frequent snapshots to be every 60 minutes... which has nothing to do with hourlies at all.
storage-slow/cloud@autosnap_2022-12-31_17:07:39_monthly - 575K - - - - storage-slow/cloud@autosnap_2022-12-31_17:16:16_monthly - 496K - - - - storage-slow/cloud@autosnap_2022-12-31_17:23:33_monthly - 1.42M - - - -
This is, essentially, a race condition. Your system is struggling to handle the load its been given, and you've wound up with multiple sanoid processes trying to take that snapshot. Since the first one didn't finish, the next Sanoid process can't see that snapshot, and attempts to take it again. Eventually, they all complete.
Basically you need to reduce the load on the system and/or reduce the number of times sanoid is invoked. If you've got many thousands of snapshots on the system, you'll get to the point where a simple zfs list -t snap
takes several minutes to complete, rather than completing instantly or near-instantly. If this is happening to you, you'd be advised to reduce the number of snapshots you keep (and/or increase the RAM in your system, allowing you to keep more metadata cached). A CACHE
vdev with zfs set secondarycache=metadata
on your pool might also help here, but no guarantees on that.
Via journalctl, I can see it running every minute and usually it just runs for a few seconds with the output:
"Running every minute" isn't a good idea if Sanoid can't finish generating a list of snapshots more quickly than that. Try dropping your systemd timer to run every 15 minutes, and see if that helps.
Thanks a lot for your answer!
If you've got many thousands of snapshots on the system, you'll get to the point where a simple zfs list -t snap takes several minutes to complete
Yes, that is exactly what is happening. Completing that command takes several minutes and produces several 1000 lines of output.
"Running every minute" isn't a good idea if Sanoid can't finish generating a list of snapshots more quickly than that. Try dropping your systemd timer to run every 15 minutes, and see if that helps.
I thought that using systemd timers instead of cron jobs would prevent this, since the timers should not be invoked again if the previous process invoked by that timer did not yet finish, while cron would simply start a new invocation regardless (or so I thought).
I will set "frequent_period = 0" in my /etc/sanoid/sanoid.conf
and change my /lib/systemd/system/sanoid.timer
to "OnCalendar=*:0/15" and see if that works!
EDIT:
Will a systemctl reload sanoid
suffice to apply the new sanoid.conf or will I need systemctl restart sanoid
? I'll probably need a systemctl daemon-reload
, too, for the timer, right? Can I do all of these if sanoid is possibly taking/pruning snapshots at that moment?
EDIT2: Yes, both commands worked fine! :-)
Your suggestions worked perfectly!
After setting "frequent_period = 0" and the systemd timer to 15 minutes, only hourly snapshots are taken at exactly the full hour.
Thanks a lot for the help!
With the frequent_period = 60 I assumed, it would only take one hourly per hour.
This setting doesn't mean what you think it means. Frequent_period is a user-definable period for those who want automatic snapshots taken more than once per hour. What you've done is essentially tell your system that you want the period for frequent snapshots to be every 60 minutes... which has nothing to do with hourlies at all.
storage-slow/cloud@autosnap_2022-12-31_17:07:39_monthly - 575K - - - - storage-slow/cloud@autosnap_2022-12-31_17:16:16_monthly - 496K - - - - storage-slow/cloud@autosnap_2022-12-31_17:23:33_monthly - 1.42M - - - -
This is, essentially, a race condition. Your system is struggling to handle the load its been given, and you've wound up with multiple sanoid processes trying to take that snapshot. Since the first one didn't finish, the next Sanoid process can't see that snapshot, and attempts to take it again. Eventually, they all complete.
Basically you need to reduce the load on the system and/or reduce the number of times sanoid is invoked. If you've got many thousands of snapshots on the system, you'll get to the point where a simple
zfs list -t snap
takes several minutes to complete, rather than completing instantly or near-instantly. If this is happening to you, you'd be advised to reduce the number of snapshots you keep (and/or increase the RAM in your system, allowing you to keep more metadata cached). ACACHE
vdev withzfs set secondarycache=metadata
on your pool might also help here, but no guarantees on that.Via journalctl, I can see it running every minute and usually it just runs for a few seconds with the output:
"Running every minute" isn't a good idea if Sanoid can't finish generating a list of snapshots more quickly than that. Try dropping your systemd timer to run every 15 minutes, and see if that helps.
I have run into this issue. I have many datasets with multiple very large pools on the same host. It doesn't really make sense to me that users should manually adjust timers and guess how long it might take. Shouldn't sanoid be atomic in nature and take care of multiple running instances rather than generate lots of incorrect/unnecessary snaps? It's also a feedback loop since this makes pruning take even longer.
Hi there,
I have a similar issue to this one (https://github.com/jimsalterjrs/sanoid/issues/526) with sanoid creating multiple hourly snapshots in one hour, multiple dailies in a day and multiple monthlies per month. All of them are then labelled "pool/dataset@autosnap_date_time_hourly/daily/monthly" like so:
This leads to several hundred partial monthlies created per month, for example: in December 2022 there are 979 "_monthly" snapshots created and kept. Several dozen partial dailies per day and 1-5 partial hourlies per hour.
While I would have expected the behavior to be:
My
/etc/sanoid/sanoid.conf
is the following:With the
frequent_period = 60
I assumed, it would only take one hourly per hour.My
/lib/systemd/system/sanoid.service
file looks like this:and is started every minute by the systemd timer in
/lib/systemd/system/sanoid.timer
:Via journalctl, I can see it running every minute and usually it just runs for a few seconds with the output:
but on some occasions it does take snapshots and then usually runs for a few minutes (5-20 mins) and outputs all the snapshots taken (and taking the above-mentioned partial hourlies, dailies and monthlies with the current timestamp):
Since it does that on seemingly random times, I assume it is taking snapshots whenever something has actually changed in those directories (???) and skips taking snapshots when the data wasn't altered?
Or is this behavior only happening because I did not specify:
like in the default config?