prometheus-pve / prometheus-pve-exporter

Exposes information gathered from Proxmox VE cluster for use by the Prometheus monitoring system
Apache License 2.0
791 stars 93 forks source link

metrics suggestion: backup jobs, replication jobs #112

Open steveej opened 2 years ago

steveej commented 2 years ago

hey @znerol, thank you for creating this helpful exporter :raised_hands:

i'd like to track and set up alerts for failed or absent backups, replications, and on high IO delay (the one that's displayed in the webui for each node).

cheers :wave:

znerol commented 2 years ago

This exporter is using the PVE REST API. Looking through the API docs I have found the following interesting routes possibly covering your requirements (at least partly):

absent backups: cluster/backup-info/not-backet-up lists all guests (qemu and lxc) which are not covered by any backup plan. failed backups: Maybe this is extractable from /cluster/backup. failed replications: Maybe this is extractable from /cluster/replication

Regarding high IO delay I recommend to take a look at node_exporter. For node level metrics, this is usually the better option.

steveej commented 2 years ago

thanks @znerol

cluster/backup-info/not-backet-up lists all guests (qemu and lxc) which are not covered by any backup plan.

while i originally meant backup jobs who for some reason didn't execute, i also like the idea of alerting when a VM doesn't have a backup job at all.

for the rest i'll also have a look at the API to see which items would be useful to add.

Regarding high IO delay I recommend to take a look at node_exporter. For node level metrics, this is usually the better option.

indeed, thanks! i thought PVE was doing something special but according to the frontend code it evaluates the system's wait load, which can be gathered otherwise.

xziy commented 11 months ago

Hello everyone, is there any progress? I faced a similar problem. I need to know which machines were left without backup, or there was an error.

StarkZarn commented 6 months ago

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

znerol commented 6 months ago

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

Please use node_exporter for the iowait metric. Take a look at this blog post for a start.

StarkZarn commented 6 months ago

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

Please use node_exporter for the iowait metric. Take a look at this blog post for a start.

Thank you!

znerol commented 4 months ago

Thenks to @svengerber and @themoriarti, replication metrics are available as of release v3.3.0.