prometheus / client_python

Prometheus instrumentation library for Python applications
Apache License 2.0
3.93k stars 795 forks source link

Remove gauge metric #1055

Closed danielstankw closed 1 week ago

danielstankw commented 1 month ago

Hi all, I have written a custom exporter for calculating the cost of using a node in AWS. Here is how it works: Lets assume I have 3 nodes, each costs 5$/ 1h, when I plot using grafana the sum(cost_metric{}) i get 15$ (3x5$). Lets say after 2 hours one of the nodes get deleted (autoscaling). In that case the total cost should drop to 10$.

The problem is that in my case the metric is preserved and even though the node has been deleted the cost is kept and thus it displays 15$ instead of dropping to 10$

How would I go about saving that problem?

cost_metric = Gauge(
    "cost_metric ",
    "Cost of running an instance for 1 hour",
    ["node_name", "instance_type"],
)
...

node_names = get_nodes()
    for node_name in node_names:
        node_info = get_node_info(node_name)
        if node_info is None:
            continue

        logging.info(f"Updating metrics for node: {node_name}")

        # labels section
        labels = node_info["metadata"]["labels"]
        instance_type = labels.get("beta.kubernetes.io/instance-type", "unknown")
        cost = get_cost_of_instance(instance_type)

        if cost is not None:
            cost_metric.labels(node_name=node_name, instance_type=instance_type).set(cost)

I tried

I collected previous and current nodes in form of dict and then wanted to removed the ones that arent existing, the issue is that :

cost_metric.remove(node_name=node_name, instance_type=instance_type)

Traceback (most recent call last):
  File "/home/XXXX/projects/main.py", line 154, in <module>
    previous_nodes = update_metrics(previous_nodes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eksohio/projects/main.py", line 131, in update_metrics
    cost.remove(node_name=node_name, instance_type=instance_type)
TypeError: MetricWrapperBase.remove() got an unexpected keyword argument 'node_name'
danielstankw commented 1 month ago

I can set the cost to 0, and bypass it that way, but it will still result in metric that is no longer needed being preserved and thus over time, consuming space. :/

csmarchbanks commented 2 weeks ago

Hello, this sounds like the use case for a custom collector: https://prometheus.github.io/client_python/collector/custom/. You will only add metrics for the nodes that you want to include in the output so no extra series will be left around.

danielstankw commented 2 weeks ago

@csmarchbanks thanks for the hint, Would you be able to elaborate a bit more on how would that work>?

csmarchbanks commented 1 week ago

That would work by running your get_node and other logic during each scrape and only having cost_metric exist for the lifetime of the scrape. That way if an instance disappears it will automatically just not appear during the next scrape's output. Adapting the example a bit for your case (I have not run/tested this but it should give the idea):

from prometheus_client.core import GaugeMetricFamily, REGISTRY
from prometheus_client.registry import Collector

class CustomCollector(Collector):
    def collect(self):
        cost_metric = GaugeMetricFamily("cost_metric ",
            "Cost of running an instance for 1 hour",
            ["node_name", "instance_type"],
        )

        node_names = get_nodes()
        for node_name in node_names:
            node_info = get_node_info(node_name)
            # ... collect label info, etc... from your code.
            if cost is not None:
                cost_metric.labels(node_name=node_name, instance_type=instance_type).set(cost)

        yield cost_metric

REGISTRY.register(CustomCollector())
danielstankw commented 1 week ago

@csmarchbanks I will test it out, thanks a ton for taking your time and providing an example :)