ngosang / restic-exporter

Prometheus exporter for the Restic backup system
MIT License
79 stars 17 forks source link

Alert example fires for maintained snapshots #11

Closed v4rakh closed 1 year ago

v4rakh commented 1 year ago

Hi,

maybe it's just my use case, but I found it confusing to get alerts for old snapshots which I want to keep, e.g. more than 15 days for my backups. This though happens with the example alert provided in the README file as restic-exporter reports timestamps for each snapshot, of which some might be old, yes. For my case the alert should only fire if the latest snapshot has a certain age, e.g. a backup has potentially been missed.

Maybe we like to add it to the README?

# for one day
(time() - (max without (snapshot_hash) (restic_backup_timestamp))) > (1 * 86400)

# for 15 days as currently outlined in the README
(time() - (max without (snapshot_hash) (restic_backup_timestamp))) > (15 * 86400)
ngosang commented 1 year ago

Alerts are optional and they are provided just as reference.

The common use case is to automate backups with a cron task or another scheduler. I'm doing incremental backups every day. I have configured the 2 alerts in the readme:

In you case, you can keep the first alert and do custom alerts for specific backups. Could you publish the response of the /metrics endpoint? I would like to understand the issue better.

v4rakh commented 1 year ago

Always thanks for your quick reply.

Not sure it's an issue and not expected behavior of the counter? Each snapshot hash has its own counter attribute exported. So if a lot of snapshots are retained, they'll end up in the metrics or is this unexpected behavior?

Here's an example of one of my backups, copied output of the endpoint of the exporter:

# HELP restic_check_success Result of restic check operation in the repository
# TYPE restic_check_success gauge
restic_check_success 2.0
# HELP restic_snapshots_total Total number of snapshots in the repository
# TYPE restic_snapshots_total counter
restic_snapshots_total 24.0
# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09

EDIT: I have multiple exporters running at a time by the way. They are all exported as a different instance and scraped separately. Not sure if that's helpful.

Interestingly enough, I've just ran restic snapshots manually and the result somehow differs:

repository 2663d695 opened (version 1)
ID        Time                 Host        Tags        Paths
-----------------------------------------------------------------------
c72f769a  2023-03-17 01:00:03  mantell                 /home/data/stripped
5afb6fb9  2023-03-20 01:00:15  mantell                 /home/data/stripped
599983ef  2023-03-22 01:00:14  mantell                 /home/data/stripped

The hashes seem to not match, first time I take a closer look though.

ngosang commented 1 year ago

The label "snapshot_hash" in the exporter is not the snapshot hash in Restic. The hash in Restic changes frequently when you do maintenance operations or full backups. The exporter hash is calculated with the hostname and the path => https://github.com/ngosang/restic-exporter/blob/f2fe3aff545ea3402ba3427937bd0a38f05efb67/restic-exporter.py#L279 In you case, you should have just 1 line in the exporter because the hostname and path matches. :thinking:

Update: Could you run this command?

v4rakh commented 1 year ago

Cleaned up my setup, so assigning different networks to each of the exporters in my docker-compose file, but I guess the root cause is something different, at least that solved it for one exporter, but now the other one has more.

Correct one:

# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="8bc201180ab9e700369b659f3f0a75c99dd1a3a72fbbd9e8ad24389d3917cead",snapshot_tag=""} 1.679443214e+09
[{"time":"2023-03-22T01:00:14.29359609+01:00","parent":"5afb6fb92919f038a09f8f53fd6181b556e16ed0d1a7b5c31d3666696c11ef10","tree":"2c9a7be5967957a22deb5e28b5fc89ebdce29ffe70465bba46e145f1d074d8f2","paths":["/home/data/music"],"hostname":"mantell","username":"root","id":"599983efe95ad60dcf93a8ef05a23e6ab38f9b11156849c91fe3a57d8ce80c7e","short_id":"599983ef"}]

Sorry, closed too early.

Now I checked another exporter and the underlying repository with the command you've asked for and it actually returned an array with more than 1 entry.

[
  {
    "time": "2022-10-31T03:30:11.085083219+01:00",
    "parent": "19e95feb063cab22ed2281fbfed5aa16c53c03ab62ed312b06e55a91e5bb2244",
    "tree": "7fc6f808fd4165c3736d85f517420d8cdee14c59c406df69600c8a1683a15599",
    "paths":
      [
        "/etc",
      ],
    "hostname": "mantell",
    "username": "root",
    "id": "c7ef0b0cc48bdc723b8822589ea5988fd4c3671c08f581aa5d79fab26d1c9690",
    "short_id": "c7ef0b0c",
  },
  {
    "time": "2023-01-16T03:30:03.544488116+01:00",
    "parent": "4f5b4ef0d38223ccabff879b075e2361f7b9c373ea8b05c622236c0cc160b2b7",
    "tree": "302e423e9321b826044973b8a7591099346fdef73f6c1cd1e5c77c2d3d450906",
    "paths":
      [
        "/etc",
      ],
    "hostname": "mantell",
    "username": "root",
    "id": "a3381e1fecced88ec099b447355fc22f10fdb4b7b24a33805fdf4c024eb31924",
    "short_id": "a3381e1f",
  },
  {
    "time": "2023-01-23T03:30:02.336435099+01:00",
    "tree": "ad4ba7054b03c3325b62a3f6e724a2160ba13cbecfe266523ef5ea4a640ff4ed",
    "paths":
      [
        "/etc",
      ],
    "hostname": "mantell",
    "username": "root",
    "id": "18907418d224b7cb3dbaa2f9a1e80bed02a15b4de8d0c06d7276c2636655fab3",
    "short_id": "18907418",
  },
  {
    "time": "2023-01-30T03:30:02.926379261+01:00",
    "parent": "9176548f5fe7369dda9f923b229e8bb8fd19a2bd9a0e1bc2ca462a7d157d7608",
    "tree": "66223f24e4b670ebb836404a9cbc403627c30a455de7c789f96768c9949f22c5",
    "paths":
      [
        "/etc",
      ],
    "hostname": "mantell",
    "username": "root",
    "id": "486b28989cfffac24a34b037b14cacdaa3343849e499c3f41351f1b80eb7967a",
    "short_id": "486b2898",
  },
  {
    "time": "2023-03-20T03:30:04.545656117+01:00",
    "parent": "8c45d75b8a8c08984556e352397850a1b0646359280a1a561463c04835d93c88",
    "tree": "db41d0846694a609f9b9a651ca91dceb371792b9442c294947d1387e7f91a839",
    "paths":
      [
        "/etc",
      ],
    "hostname": "mantell",
    "username": "root",
    "id": "4d1d3f09f6f056a905d6f295375cea23d28e83dfe10da2331307b906c4e64959",
    "short_id": "4d1d3f09",
  },
]

The exporter reports the following

# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09
ngosang commented 1 year ago

Could you try the previous release 1.1.0 ?

v4rakh commented 1 year ago

Same result. By the way, I always had that and thought it was expected behavior that all snapshots have their own gauge. :-)

So maybe it wasn't even the network setup I had. Would have been weird. But anyway, the output of snapshots --json --latest 1 also seems to report more than one.

Not sure if it helps, but restic version is 0.15.1 on the host which creates the snapshots.

ngosang commented 1 year ago

The results you posted in https://github.com/ngosang/restic-exporter/issues/11#issuecomment-1481644538 don't make sense. I calculated the hash and it's impossible the json with 5 snapshots produce that 5 metrics. Could you double check you are getting the json and the metrics from the same repository? I have some ideas to improve the code but I have to reproduce the issue first.

v4rakh commented 1 year ago

So this is from one repository: https://github.com/ngosang/restic-exporter/issues/11#issuecomment-1481612182 This is from the other: (the second half of the post): https://github.com/ngosang/restic-exporter/issues/11#issuecomment-1481644538 I'll double check later.

ngosang commented 1 year ago

Test this PR. It should fix your problems => https://github.com/ngosang/restic-exporter/pull/12

v4rakh commented 1 year ago

Thanks for providing this, I tested it by building the docker image locally and I have very same results.

# HELP restic_backup_timestamp Timestamp of the last backup
# TYPE restic_backup_timestamp gauge
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="20c494f14bbb7e5188a4b36702a1dcce59baa4c516f34268106f92f494eba783",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="734c259855b1ad6067777f85598521cab79a4d0fd5a149b4698d8081de33ca88",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="f5247872f6ead56c1d7add82e8cbe2d873f47f894fdacd93f22b4b5140273a3b",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="8571d9536f0d1616c6e99ad3a1f68d94af547e2616a343e29e97e3c4a2ed557f",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="14fb46e68138c0033e32be3a67ac6aa20f6c95f6f4f339e1ee83e1cec4ad8d93",snapshot_tag=""} 1.679279404e+09
ngosang commented 1 year ago

@v4rakh compile this PR and give the log traces. https://github.com/ngosang/restic-exporter/pull/13

v4rakh commented 1 year ago

See comment in #13.

ngosang commented 1 year ago

@v4rakh When I run the code with your dump I see 5 metrics:

restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="428a4022933f2a1e162cbfa6685055afb27fbaefb20b784c63fbefc33a25d49e",snapshot_tag=""} 1.667183411e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="2ec546bf3f53ecef07491a6536fe1e889b9e2d3a230d26cd3d0b189fd9325bb3",snapshot_tag=""} 1.673836203e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="dba9a400b7961289953865932ed0c142ed218bcc0d736ca8bb92af2141340160",snapshot_tag=""} 1.674441002e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="13a9064b8f2f0c6e176b6997dcd52668660d5f2e8dbcf77ed047f4a26e20ed6b",snapshot_tag=""} 1.675045802e+09
restic_backup_timestamp{client_hostname="mantell",client_username="root",snapshot_hash="57a5b3ab8dfb3d770ff484bf52a75ec94569d8a8e2da68dd932b193675f99629",snapshot_tag=""} 1.679279404e+09

That is totally fine because each backup of you have different folders. For example, you don't have embyserver in the other snapshots.

        "paths": [
            "/etc",
            "/home/admin",
            "/home/musicstreamer",
            "/opt/embyserver/config",
            "/opt/jellyfin/config",
            "/opt/portainer",
            "/opt/unifi",
            "/root",
            "/tmp/package_list.txt",
            "/tmp/package_list_aur.txt",
            "/usr/local/bin"
        ],
        "hostname": "mantell",
        "username": "root",

        "paths": [
            "/etc",
            "/home/admin",
            "/home/musicstreamer",
            "/opt/jellyfin/config",
            "/opt/nodered_data",
            "/opt/portainer",
            "/opt/prometheus_config",
            "/opt/unifi",
            "/root",
            "/tmp/package_list.txt",
            "/tmp/package_list_aur.txt",
            "/usr/local/bin"
        ],
        "hostname": "mantell",
        "username": "root",

The function to calc the hash takes into account the username, hostname and paths. If you have different paths they are considered different backups. That makes sense to me if you want to track the number of files or size across the time. You can not compare different things.

https://github.com/ngosang/restic-exporter/blob/1c55ffe6e6cd136e298bc456cb2558c857dada52/restic-exporter.py#L288

Since most of your backups have the same folders I would recommend you to include all folders in just 1 backup and delete all previous backups. I'm closing this since I can not fix what is not broken.

v4rakh commented 1 year ago

I agree. Changing source directories though can be a requirement, e.g. not everyone will create a new restic repository or wipe all existing snapshots when adding new or changing existing paths to it, e.g. for an installed application. In my example I could have used /opt instead of listing them individually or work with exclude, though I prefer to explicitly include them.

Just an idea here, would it be an option to include an env var to change the hash calculation to include paths or not and by default it's enabled to include it?

Also, documenting the above mentioned different alerts could also be of help for people having a similar setup, declining that this is a use case is probably not correct, I mean that source directories won't change forever. Your call, if you like to keep, I would still propose to document how the hash is actually being calculated/what impact it has to the underlying metrics then.

Thanks for looking into it in depth.

ngosang commented 1 year ago

14