prometheus / pushgateway

Push acceptor for ephemeral and batch jobs.
Apache License 2.0
3.02k stars 469 forks source link

conflict pushing metrics with the same job, different instance #551

Closed kr886q closed 1 year ago

kr886q commented 1 year ago

Feature request

So that asnc ephermal tasks can push results to prometheus without conflict. data is being lost and i dont know why

“Nice to have” is not a good use case. :)

Bug Report

What did you do?ich circumstances?**

I have a flask endpoint that is responsible for launching the following celery task:


@celery_app.task(name="monitor_metrics", bind=True, base=AbortableTask)
def monitor_metrics(self, vnf_name, vnf_ip, vnf_user, vnf_pass, suite_id):
    push_gateway = f"{PROMETHEUS_PUSHGATEWAY}:{PROMETHEUS_PUSHGATEWAY_PORT}"

    # Setup Prometheus Gauge
    registry = CollectorRegistry()
    gauge = Gauge(f'infra_health_manager',  # Metric name my_guage
                  f'A custom gauge for capturing VNF CPU run by thyme-infra-health-manager',  # comment description of gauge
                  ['instance', 'cpu'], registry=registry)

    # Initialize SSH Session
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(vnf_ip, username=vnf_user, password=vnf_pass)

    # Infinite loop to get and push metrics to prometheus
    while not self.is_aborted():
        try:    # Try and Use existing SSH Session
            stdin, stdout, stderr = ssh.exec_command('top -b -n 1')
            output = stdout.read().decode('utf-8')
        except Exception as error:
            print(f"ERROR {error}")
            print(error)
            # Renitialize SSH Session
            ssh = paramiko.SSHClient()
            ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
            ssh.connect(vnf_ip, username=vnf_user, password=vnf_pass)
            stdin, stdout, stderr = ssh.exec_command('top -b -n 1')
            output = stdout.read().decode('utf-8')

        # Parse the output using regex
        cpu_pattern = re.compile(r"%Cpu(\d+)\s+:\s+\d+\.\d+/\d+\.\d+\s+(\d+)")
        matches = cpu_pattern.findall(output)
        cpu_usage = {f"Cpu{match[0]}": int(match[1]) for match in matches}  # Store the result in a dictionary

        print(f"{vnf_name} CPU_USAGE: {cpu_usage}")

        for cpu in cpu_usage:
            gauge.labels(vnf_name, cpu).set(cpu_usage[cpu])
        push_to_gateway(push_gateway, job=suite_id, registry=registry)
        time.sleep(VNF_MONITOR_FREQUENCY)

    logger.info(f"Stopping {vnf_name} CPU Monitor")
    return True

Each task is supposed to monitor a unique vm and get the cpu data. due to certain requirements this is the only way to get the cpu data out of the vm.

The metrics appear perfectly in prometheus when there is only one task running. when a second task is launched the metrics that are stored in prometheus are very spotty. There appears to be some conflict but I am not clear on what that is.

image

Looking at the logs on my celery task I can see that the ssh command is succeeding and it getting the correct CPU numbers. I have looking into grouping_key, job, and instance documentation, but its very poor. I have tried a few changes but with no success. for a little more backround the job, represents a unique id of a given testing cycle, the instance is the name of a VM, and then each instance may have 2-8 cpus.

Environment

kr886q commented 1 year ago

Refreshing the metrics endpoint of pushgateway,

I see that the metrics are getting overridden sometimes it shows the one vnf other times it will show the other

image