openstack_exporter metrics endpoint responds very slow / exceeds scrape timeout

Nils98Ar commented 8 months ago

The GET request takes 2:20 minutes which is longer than the scrape_timeout:

dragon@mon01:~$ time curl http://<IP>:9198/metrics

[...]

real    2m32.321s
user    0m0.020s
sys     0m0.024s

These are the counts of "non info" log entries of the openstack_exporter container with the entries for not deployed services (baremetal, container-infra, database, orchestration) excluded:

dragon@mon01:~$ docker logs prometheus_openstack_exporter 2>&1 | grep -v "level=info" | grep -E "(2024-03-06T10|2024/03/06)" | cut -d " " -f3- | grep -v "No suitable endpoint could be found in the service catalog" | sort | uniq -c
    617 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
     19 msg="Failed to collect metric for exporter: nova, error: failed to collect metric: running_vms, error: CPUInfo has unexpected type: <nil>" source="exporter.go:123"
     19 msg="Failed to collect metric for exporter: nova, error: failed to collect metric: security_groups, error: Resource not found: [GET https://api-int.aov.cloud:8774/v2.1/os-security-groups], error message: {\"itemNotFound\": {\"code\": 404, \"message\": \"The resource could not be found.\"}}" source="exporter.go:123"
     19 msg="metric: volume_status has been deprecated on cinder exporter in version 1.4 and it will be removed in next release" source="exporter.go:153"

At least the first one seems not right, I am not sure about the last three.

berendt commented 8 months ago

There is a pretty old version included at the moment. Will be better with the next OSISM release.

Nils98Ar commented 8 months ago

@berendt Anyway strange that the issues suddenly started 7 days ago... but waiting for the next OSISM release would be okay for me.

Nils98Ar commented 8 months ago

Maybe the OpenStack Health Mon that we are running since 12-13 days could lead to an increased metrics volume?

berendt commented 8 months ago

Maybe the OpenStack Health Mon that we are running since 12-13 days could lead to an increased metrics volume?

Yes, definitely. The OpenStack Exporter simply hits the API and thus the DB. OHM generates a lot of resources. Depending on the interval at which you scrape and how your control plane is equipped, this can generate a considerable load.

Nils98Ar commented 8 months ago

Defaults where:

prometheus_scrape_interval: "60s"
prometheus_openstack_exporter_interval: "{{ prometheus_scrape_interval }}"
prometheus_openstack_exporter_timeout: "45s"

I've now configured this and we will see if it helps:

prometheus_openstack_exporter_interval: "195s"
prometheus_openstack_exporter_timeout: "180s"

berendt commented 7 months ago

I'm scared that we can only document this for the time being as it hasn't improved with the last release of the exporter.

osism / issues

openstack_exporter metrics endpoint responds very slow / exceeds scrape timeout #895