openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
170 stars 55 forks source link

Slow changing metrics are not published to prometheus output #370

Closed mathieudevos closed 6 months ago

mathieudevos commented 7 months ago

Hey,

I'm using prometheus to scrape both "fast" changing data such as counters (openconfig:/interfaces/interface/state/counters) and data which most likely won't change for days: openconfig:/interfaces/interface/state/oper-status.

In my output setup I just output to a file and prometheus, the file is there mostly for just confirming my information is valid. Outputs look as follows:

outputs:
  prom-output:
    type: prometheus
    listen: gnmic:10103
  file-output:
    type: file
    filename: /app/output
    multiline: true

Prometheus will just scrape the fast changing data, but the slow changing data such as the oper-status of the interfaces never shows up in my prometheus. If I remove the fast changing data, then the metrics endpoint just stays empty, however, the info is present in the file output. E.g. of the file output: (obscured serials and other defining items)

{
  "source": "www.switch-target.com",
  "subscription-name": "switchA",
  "timestamp": 1704887357029284763,
  "time": "2024-01-10T11:49:17.029284763Z",
  "target": "switchAserial",
  "updates": [
    {
      "Path": "interfaces/interface[name=Ethernet26/3]/state/oper-status",
      "values": {
        "interfaces/interface/state/oper-status": "DOWN"
      }
    }
  ]
}

However, the file hasn't been changed for many minutes, because probably the targets haven't changed their port status, which is normal.

However, since my whole stack: grafana, prometheus, gnmic all go up at the same time, prometheus does not get to scrap this target at all. As such, my question is: is there a way to force specific metrics to be published every now and then?

I've set heartbeat-interval, updates-only is set to false. I do not mind that duplicate information is gathered in prometheus, if it becomes too much I can just lower the interval at which I fetch the data.

I hope this clarifies the issue I'm having. If I'm doing something wrong with the config, please let me know.

hellt commented 7 months ago

Hi @mathieudevos

a couple of quick comments:

  1. Prometheus doesn't stoped measurements in types other than float. This usually means that you have to use a strings processor first to convert your DOWN string to 0 or 1. See an example
  2. As you noted, Prometheus also doesn't quite designed to work with infrequently changed data. The effects of it is that Prometheus reports metrics to go "stale" when no data is presented to be scraped in a given (configurable) interval. This makes it challenging to use Prometheus with ON_CHANGE subscriptions for data that changes rarely.
  3. There is a trick for (2) - I think it is done by setting the expiration to -1 and enabling the cache (https://gnmic.openconfig.net/user_guide/outputs/prometheus_output/#expiration) but Karim might know better
mathieudevos commented 7 months ago

Hey,

Cheers for the quick reply! For the time being I've swapped over to InfluxDB so to avoid the first issue/comment. The other ones were indeed fixed by setting the cache expiration to -1 (I believe -1s didn't work, but -1 does) and setting a cache flush timer.

I think that the final issue we're now having should be fixable if we choose to overwrite the timestamps: we'd want to get the timestamp of when the information is last read, not when it was last changed. I'll attempt to overwrite the functionality of the timestamps.

Thank you for the quick suggestions, they've been most helpful with obtaining the data I'm looking for.