Add ZFS replication metrics

svengerber commented 5 months ago

This PR adds replication metrics as requested in issue #112.

Replication Metrics are fetched per node
The metrics can be enabled or disabled

I reworked the original PR #166 to the new file structure.

znerol commented 5 months ago

Hi, thanks for filing this PR. It makes sense to scrape the replications per node, so code location, naming and overall approach is correct.

I'm not too happy with the metric structure though. This project follows the common practice of exposing an X_info gauge with a complete set of labels. The actual metrics only have an id label. The following excerpt from the metrics example in the readme demonstrates the approach nicely:

# HELP pve_uptime_seconds Number of seconds since the last boot
# TYPE pve_uptime_seconds gauge
pve_uptime_seconds{id="qemu/100"} 315039.0
[...]
# HELP pve_guest_info VM/CT info
# TYPE pve_guest_info gauge
pve_guest_info{id="qemu/100",name="samplevm1",node="proxmox",type="qemu"} 1.0

Also, it would be great if the guest, source and target labels would follow the same pattern (i.e., {type}/{id}). This does simplify joining the new metrics to other time series. I'm not sure whether the id should be changed as well. This could be interesting in combination with a pve_up metric. I.e., if there is a way to determine whether a replication job is up (if there is a useful definition of up), then it could be interesting to emit pve_up{id="replication/1-0"} 1 with the same {type}/{id} pattern in order to avoid conflicts with other time series.

Finally, there is a best practice document by prometheus about metric and label naming. Metric names ... should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

Thus, I'd expect a similar structure from the replication metrics.

    # HELP pve_replication_info Proxmox replication info
    # TYPE pve_replication_info gauge
    pve_replication_info{id="1-0",guest="qemu/1",source="node/proxmox1",target="node/proxmox2",type="local"} 1

    # HELP pve_replication_duration_seconds Proxmox vm replication duration
    # TYPE pve_replication_duration_seconds gauge
    pve_replication_duration_seconds{id="1-0"} 7.73584
    # HELP pve_replication_last_sync_timestamp_seconds Proxmox vm replication last_sync
    # TYPE pve_replication_last_sync_timestamp_seconds gauge
    pve_replication_last_sync_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_last_try_timestamp_seconds Proxmox vm replication last_try
    # TYPE pve_replication_last_try_timestamp_seconds gauge
    pve_replication_last_try_timestamp_seconds{id="1-0"} 1.713382503e+09
    # HELP pve_replication_next_sync_timestamp_seconds Proxmox vm replication next_sync
    # TYPE pve_replication_next_sync_timestamp_seconds gauge
    pve_replication_next_sync_timestamp_seconds{id="1-0"} 1.7134689e+09
    # HELP pve_replication_failures_total Proxmox vm replication fail_count
    # TYPE pve_replication_failures_total gauge
    pve_replication_failures_total{id="1-0"} 0.0

svengerber commented 5 months ago

Thank you for the review.

I have updated the following points:

metric naming
added info metric with labels (source,target,guest) in {type}/{id} format
remove unnecessary labels from other metrics
Added pve_up metric for replication jobs

Replication Jobs in Proxmox can be enabled or disabled. The Proxomx API will return disabled: 1 is the replication job is disabled and nothing if the status is enabled. Based on this I am currently setting the pve_up value.

I have left the id to be the replication id from the Proxmox API. This value is unique in a cluster and a single VM can have multiple replication jobs, therefore we cannot use the qemu/lxc id.

znerol commented 5 months ago

Cool! Thanks a lot for this update. I have to confess, that I do not yet fully grasp the data model of the PVE API regarding replication jobs. I neither have access to a cluster where I could take a look myself. Thus, please bear with me a little bit while I'm putting together a mental model for it.

According to the commits in this PR and the API docs, there are two different API routes where information about replication jobs can be gathered:

Regrettably, the API docs do not specify the structure of the returned data. Thus I took a look at the respective admin ui sections in my cluster. It looks like the interesting metrics (i.e., the various timestamps) are only available from the node-endpoint. The cluster endpoint only lists configuration values and no state. If that observation is correct, then scraping the state from nodes is certainly the best option we have.

The current PR seems to scrape the pve_up metric from cluster resources. Hence, people running separate scrape jobs for node and cluster metrics end up scraping replication related stuff in two places. I fear that might lead to surprises. Also basing the pve_up metric on the enabled flag doesn't mix well with the other definitions of pve_up (i.e., vm running, container running, node running). People are using pve_up metric for alerting purposes. Alerting on the basis of a configurable value (enabled/disabled) doesn't seem very useful to me. Thus, let's drop the pve_up metric for replication jobs unless there is a more useful data source for its value.

znerol commented 5 months ago

The PVE docs state:

If a replication job encounters problems, it is placed in an error state. In this state, the configured replication intervals get suspended temporarily.

I think that would be a good data source for pve_up if the error state is available from the nodes/{node}/replication/{id}/status endpoint.

znerol commented 5 months ago

According to the pve source code, there is a chance that a failed job contains an error key in the state object. The value seems to be a textual representation of the error.

On the other hand, the code also indicates that fail_count is reset to zero after a job has recovered. As a result, it is easy to define an alerting rule using the following expression: pve_replication_failures_total != 0. But that indicates that the _total suffix isn't exactly correct here.

znerol commented 5 months ago

According to this blog post, the _count and _total suffixes shouldn't be used for gauges. Browsing through my metrics, I found one which seems to behave quite similar to fail_count: alertmanager_cluster_failed_peers:

This metric is a gauge that shows you the current number of failed peers.

So, I guess we could rename pve_replication_failures_total to pve_replication_failed_syncs. I feel that this name naturally hints towards an alerting condition of pve_replication_failed_syncs!=0.

znerol commented 5 months ago

I pushed my proposed changes. @svengerber is that usable in your case?

svengerber commented 5 months ago

@znerol

Thank you for the updates! I have tested them against our cluster with replication jobs and it looks good.

In our case we are be monitoring agains replication failures and successfull syncs, so these metrics will work for us.

znerol commented 5 months ago

Perfect. Did you ever experience replication failures? Do you think the assumption is correct that fail_count is reset to zero after a replication job recovers?

svengerber commented 5 months ago

I have tested this in our environment and the fail_count will go back to 0 after a successfull replication.

Example return data after a sync failure:

        "fail_count": 1,
        "schedule": "XX:XX",
        "id": "1-0",
        "type": "local",
        "last_try": X,
        "jobnum": 0,
        "rate": 500,
        "guest": 1,
        "vmtype": "qemu",
        "source": "XXX",
        "next_sync": 0,
        "error": "command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=XXX' root@X.X.X.X pvecm mtunnel -migration_network X.X.X.X/X -get_migration_ip' failed: exit code 255",
        "duration": 2.24147,
        "target": "XXX",
        "last_sync": XXXX

znerol commented 5 months ago

I'm looking at this with a fresh mind, this time regarding the coding style. I fixed the following things:

Adapt API call style (use self._pve.nodes(node).replication.get() instead of self._pve(f"nodes/{node}/replication/").get()).
Change variable name of for-loop. Use jobdata instead of vmdata.

I think with those changes, the new code blends in quite nicely with the existing stuff. @svengerber would you mind checking whether I broke something with the latest commits?

svengerber commented 5 months ago

@znerol

Thank you for the update. I have tested your changes against my clusters with replication enabled and it looks good.

prometheus-pve / prometheus-pve-exporter

Add ZFS replication metrics #243