jimsalterjrs / sanoid

These are policy-driven snapshot management and replication tools which use OpenZFS for underlying next-gen storage. (Btrfs support plans are shelved unless and until btrfs becomes reliable.)
http://www.openoid.net/products/
GNU General Public License v3.0
3.1k stars 303 forks source link

Sanoid - Taking far too frequent snaps #14

Closed redmop closed 7 years ago

redmop commented 9 years ago

Relevant zfs get written

dpool/data@autosnap_2015-09-26_13:49:03_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_daily                            written   0        -

Cron line * * * * * /usr/local/bin/sanoid --cron

/etc/sanoid/sanoid.conf

######################################
# This is a sample sanoid.conf file. #
# It should go in /etc/sanoid.       #
######################################

[dpool/data]
    use_template = data
    recursive = yes

[dpool/backups]
    use_template = backup
    recursive = yes

[dpool/backup]
    use_template = backup
    recursive = yes

[dpool/root]
    use_template = os
    recursive = yes

#############################
# templates below this line #
#############################

# name your templates template_templatename. you can create your own, and use them in your module definitions above.

[template_os]
    hourly = 48
    daily = 30
    monthly = 3
    yearly = 0
    autosnap = yes
    autoprune = yes
    hourly_warn = 2880
    hourly_crit = 3600
    daily_warn = 48
    daily_crit = 60

[template_data]
    hourly = 48
    daily = 30
    monthly = 12
    yearly = 7
    autosnap = yes
    autoprune = yes
    hourly_warn = 2880
    hourly_crit = 3600
    daily_warn = 48
    daily_crit = 60

[template_backup]
    autoprune = yes
    hourly = 48
    daily = 30
    monthly = 12
    yearly = 7

    ### don't take new snapshots - snapshots on backup 
    ### datasets are replicated in from source, not
    ### generated locally
    autosnap = no

    ### monitor hourlies and dailies, but don't warn or 
    ### crit until they're over 48h old, since replication 
    ### is typically daily only
    hourly_warn = 2880
    hourly_crit = 3600
    daily_warn = 48
    daily_crit = 60
redmop commented 9 years ago

I've not made any changes to the code yet.

I thought it was first filling out the snapshot count, in other words, getting 7 yearly snapshots like I requested, but I have 13 now.

jimsalterjrs commented 9 years ago

Not sure how you managed that. If you're doing --take-snapshots directly, that might have bugs in it, because, well, I don't actually use that in production so it's a lot less heavily tested. =)

I know that using --cron in a crontab * * * * * doesn't produce extra snapshots like that.

On 09/26/2015 04:05 PM, redmop wrote:

I've not made any changes to the code yet.

I thought it was first filling out the snapshot count, in other words, getting 7 yearly snapshots like I requested, but I have 13 now.

— Reply to this email directly or view it on GitHub https://github.com/jimsalterjrs/sanoid/issues/14#issuecomment-143491376.

jimsalterjrs commented 9 years ago

I bet --take-snapshots isn't updating the cache. Try setting the cache expiration to 0 in your sanoid.conf - you can find the syntax in sanoid.defaults.conf (but don't edit that file directly!)

On September 26, 2015 16:05:05 redmop notifications@github.com wrote:

I've not made any changes to the code yet.

I thought it was first filling out the snapshot count, in other words, getting 7 yearly snapshots like I requested, but I have 13 now.


Reply to this email directly or view it on GitHub: https://github.com/jimsalterjrs/sanoid/issues/14#issuecomment-143491376

redmop commented 9 years ago

I'm not using --take-snapshots. I'm still playing with it, so I am following directions exactly. * * * * * /usr/local/bin/sanoid --cron

I don't see anything like that in sanoid.defaults.conf. Do you mean either of these?

my $forcecacheupdate = 0;
my $cacheTTL = 900; # 15 minutes

Also, it seems to have stabilized at 13 yearly snapshots. I did only ask for 7 though.

redmop commented 9 years ago

Hourly snapshots are also messing up. I'll just paste all the snapshots it's taken so far. Maybe the script is sensitive to high loads on the pool. I was using syncoid on it for a while today. This is an old server getting ready to be retired. I was using zfSnap on here. It does recursive snapshots. That might be faster/less load sensitive.

dpool/data@autosnap_2015-09-26_13:49:03_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:49:03_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:44:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:50:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:45:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:54:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:47:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:47:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:52:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:47:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:51:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:47:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:51:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:52:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:52:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:51:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:51:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:52:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:53:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:53:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:53:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:48:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:48:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:53:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:48:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:48:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_14:01:02_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:55:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:55:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:55:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:55:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:56:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:46:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:56:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:46:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_13:56:01_daily                            written   0        -
dpool/data@autosnap_2015-09-26_14:00:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_13:56:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_13:46:01_yearly                           written   0        -
dpool/data@autosnap_2015-09-26_13:46:01_monthly                          written   0        -
dpool/data@autosnap_2015-09-26_14:02:02_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_15:00:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_15:01:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_15:02:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_15:03:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_15:04:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:01:03_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:02:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:00:02_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:03:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:05:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:06:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_16:04:01_hourly                           written   0        -
dpool/data@autosnap_2015-09-26_17:00:01_hourly                           written   0        -
redmop commented 9 years ago

I'm not currently stressing the system, and it seems to be doing a better job.

Any particular reason you're not using recursive snapshots?

jimsalterjrs commented 8 years ago

have not seen this issue on any systems or heard "me toos" from anybody else - closing.

jjlawren commented 8 years ago

I've also been seeing this happen. I'm still reading through the code, but my gut is telling me it's because there isn't a lock while snapshots are being taken and multiple sanoid instances are running in parallel.

jimsalterjrs commented 8 years ago

It's possible, if you've got a really heavily loaded system. It would need to be so heavily loaded that you had snapshot creation taking longer than the time in between sanoid --cron runs, though, which sounds... pretty brutal.

Let me know if you figure out something different.

jjlawren commented 8 years ago

I don't think it's about load, I think it's more about the imposed 1 second sleep between snaps. I saw it happen on my first run with 40 zfs datasets. Take a monthly, daily, and hourly for each, add in the small delays for taking the snapshots themselves, and you're easily over a full minute runtime (default crontab interval).

Perhaps it would be safest to just have a lock to avoid multiple instances taking snapshots simultaneously.

jimsalterjrs commented 8 years ago

Well, that could certainly do it, if you have that many datasets and haven't taken many snaps.

The sleeps are actually in there to keep a chrono order available for the snaps. Granularity on the birth time for snaps is a full second, so ZFS doesn't have any way of knowing whether the hourly, daily, monthly, or yearly is "older" if they're all taken during the same second. That ended up being really obnoxious, to the point that I added the sleeps to make sure no two snaps had the same exact birth time.

Though TBH I'm forgetting now WHY that was so obnoxious, given that each type of snapshot has its own separate policy. It /did/ cause some obnoxious issue, though, I remember that much...

On 12/31/2015 03:28 PM, jjlawren wrote:

I don't think it's about load, I think it's more about the imposed 1 second sleep between snaps. I saw it happen on my first run with 40 zfs datasets. Take a monthly, daily, and hourly for each, add in the small delays for taking the snapshots themselves, and you're easily over a full minute runtime (default crontab interval).

Perhaps it would be safest to just have a lock to avoid multiple instances taking snapshots simultaneously.

— Reply to this email directly or view it on GitHub https://github.com/jimsalterjrs/sanoid/issues/14#issuecomment-168242957.

jjlawren commented 8 years ago

Possible to reopen this request?

redmop commented 8 years ago

I still have this happen from time to time, and I set sanoid to run every 5 min. The system isn't really loaded, and sanoid is managing about 10 datasets with 48 hourly, and 7 daily retention.

jimsalterjrs commented 8 years ago

I'm reopening this, but since I've been unable to duplicate I don't know that it's going to get resolved any time soon. If anybody else can figure out why it might be happening OTHER than extreme load, I'm more than willing to poke at it and resolve. Or if you want to give me remote access to a system that's experiencing the issue regularly and testably, that might work.

Until then, it's hard for me to fix something that I can't repeatably break. I don't experience this issue on any of the 100+ Sanoid hosts I manage.

kjbuente commented 7 years ago

I have a machine that is taking double snapshots. It will do one at 13:00 and then another at 13:01. I tried to set the cron job to only run sanoid every five minutes, but that meant the duplicate snapshot was five minutes after instead of one. Daily and Monthly seem fine. I am running NAS4Free, not sure if it is Linux vs *BSD thing or not. Remote access is not out of the question.

jessiebryan commented 7 years ago

I actually noticed the same thing today. My version was about 4 months old, so I updated it today from master. ZoL - Ubuntu 16.04

jimsalterjrs commented 7 years ago

Did upgrading to current solve your issue?


(Sent from my tablet - please blame any weird errors on autocorrect)

On November 2, 2016 18:50:31 Jessie Bryan notifications@github.com wrote:

I actually noticed the same thing today. My version was about 4 months old, so I updated it today from master.

You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/jimsalterjrs/sanoid/issues/14#issuecomment-258023448

Anderath commented 7 years ago

@redmop

Are all the datasets you specified in your sanoid.conf currenting existing on this machine you're running the cronjob on?

See #43

I had left the sample datasets in there and it messed up the retention I had set and snapshots did not take fully until I only explicitly stated zpools which existed on my system.

jessiebryan commented 7 years ago

I am still seeing excessive hourly snapshots within the same hour. Perhaps it's my CFG? Take a look:

http://hastebin.com/rowoweheti.coffeescript

ionutz22 commented 7 years ago

I observed similar issues with daily on one of my systems (centos 7 zol). timezone: EST sanoid: 1.4.6c

http://www.hastebin.com/uhadikonot.coffeescript

it looks all daily snapshots where taken at 23:59:01 every day ... but somehow on 6 of Nov it took daily snapshots every minute.

Could this be related to Daylight Saving ? even for EST timezone 2AM becomes 1 AM ... which don't coincide with the times from snapshot.

-I.

svennd commented 7 years ago

I got it cause of a lock had remained causing ps to throw an error. might not be related tho. (i had a few thousand snapshots)

jimsalterjrs commented 7 years ago

closing again, filed under "wtflol". If somebody can produce a replicable testcase, please let me know.

redmop commented 7 years ago

@Anderath I've not used sanoid for a while, though I will be setting it up again on 3 servers within the next week, so I don't know how I had the datasets setup.

varesa commented 6 years ago

@jimsalterjrs I am also seeing for monthlies accumulate daily without a limit. I can't unfortunately (at least yet) produce a replicable test case but I noticed that all the snapshot times seem to correspond when the system was (re)started in the morning:

ssd/vms@autosnap_2018-08-31_07:53:01_monthly        1.73M      -  98.7G  -
ssd/vms@autosnap_2018-09-03_07:36:01_monthly        2.94M      -  98.8G  -
ssd/vms@autosnap_2018-09-04_07:52:02_monthly        2.29M      -  98.9G  -
ssd/vms@autosnap_2018-09-05_07:36:01_monthly        1.71M      -  98.9G  -
ssd/vms@autosnap_2018-09-06_07:37:02_monthly        4.21M      -  99.0G  -
ssd/vms@autosnap_2018-09-07_07:36:02_monthly        1.75M      -  99.3G  -
ssd/vms@autosnap_2018-09-10_07:36:02_monthly        5.75M      -  99.4G  -
ssd/vms@autosnap_2018-09-12_07:59:02_monthly        26.9M      -  99.5G  -
ssd/vms@autosnap_2018-09-13_07:39:02_monthly        11.0M      -  99.6G  -
ssd/vms@autosnap_2018-09-14_07:35:02_monthly        2.03M      -  99.8G  -
ssd/vms@autosnap_2018-09-17_07:37:02_monthly        5.45M      -  99.9G  -
ssd/vms@autosnap_2018-09-18_07:36:01_monthly        1.03M      -   100G  -
ssd/vms@autosnap_2018-09-19_07:37:01_monthly        7.93M      -   100G  -
ssd/vms@autosnap_2018-09-21_07:40:01_monthly         812K      -   101G  -
ssd/vms@autosnap_2018-09-22_14:00:01_monthly         188K      -   101G  -
ssd/vms@autosnap_2018-09-23_13:14:01_monthly        2.21M      -   101G  -
Filesystem ssd/vms has:
     124 total snapshots (newest: 0.6 hours old)
          36 hourly
              desired: 36
              newest: 0.6 hours old, named autosnap_2018-09-23_14:00:01_hourly                                                               
          58 monthly
              desired: 3
              newest: 1.4 hours old, named autosnap_2018-09-23_13:14:01_monthly                                                              
          30 daily
              desired: 30
              newest: 1.4 hours old, named autosnap_2018-09-23_13:14:01_daily                                                                

https://gist.github.com/varesa/d2da38d8ad245f8536567202dbb841c7

kiwichrish commented 3 years ago

I know this is a really old ticket... but 'me too'. :-)

On Ubuntu 16.04, stock sanoid from the repo..

sanoid --version

/usr/sbin/sanoid version 2.0.3 (Getopt::Long::GetOptions version 2.45; Perl version 5.22.1)

Destroyed all the snapshots on one filesystem earlier ontoday and now I've got:

data/vms/pbcinfo-root@autosnap_2021-02-03_06:00:01_hourly    530K      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_19:09:32_monthly   292K      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:15:01_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:15:01_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:45:02_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:45:02_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:45:02_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_06:45:02_hourly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_07:00:01_hourly    664K      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_07:30:02_monthly      0      -  8.75G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_07:30:02_weekly       0      -  8.75G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_07:30:02_daily        0      -  8.75G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_07:30:02_hourly       0      -  8.75G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_08:00:00_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_08:00:00_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_08:00:00_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_08:00:00_hourly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:00:01_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:00:01_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:00:01_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:00:01_hourly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:30:01_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:30:01_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:30:01_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_09:30:01_hourly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:00:02_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:00:02_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:00:02_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:00:02_hourly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:30:02_monthly      0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:30:02_weekly       0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:30:02_daily        0      -  8.74G  -
data/vms/pbcinfo-root@autosnap_2021-02-03_10:30:02_hourly       0      -  8.74G  -
kiwichrish commented 3 years ago

Premature enter... :-)

I've got three 16.04's running sanoid and the other two are fine, it's just this one that's going crazy with the snapshots every 30 mins.

I've just disabled the systemctl timer and created a cron job that runs every 30 mins instead.

config is really basic:

[data/vms]
    use_template = hourly
    recursive = yes
    process_children_only = yes

[template_hourly]
    frequently = 0
    hourly = 24
    daily = 14
    weekly = 5
    monthly = 12
    yearly = 0
        autosnap = yes
        autoprune = yes

This server is a wee bit slow, but not crazily slow, it's a dell 620 with 10krpm spinning rust sas drives in raid 10 for the ZFS and an SSD OS /boot.

Not many filesystems:

# zfs list -o name
NAME
data
data/iso
data/vms
data/vms/atetftp-root
data/vms/jump28-root
data/vms/pbcinfo-root
data/vms/radius2-root
data/vms/standard
data/vms/standard/ns2

Although due to this issue there are 'few' snapshots..

# zfs list -t snapshot | grep -c data
4655
#

none dated older than 30th of December last year when I discovered the issue and destroyed all the snapshots to see if it came back clean...

We'll be rebuilding this box in the next month or two to 20.04 as 16.04 is EOL in April but thought I'd add this to his issue in case it triggers something, so to speak..