Prometheus metrics issue for string values

aned commented 1 month ago

I'm using this config

targets:
  blah.com:
    blah.com:6030

subscriptions:
  sub1:
    paths:
      - "/interfaces/interface[name=*]/openconfig-interfaces:config"
    stream-mode: sample
    sample-interval: 10s
    heartbeat-interval: 60s
    updates-only: false

outputs:
  prometheus:
    type: prometheus
    listen: :9273
    append_subscription_name: false 
    strings_as_labels: true

but I'm getting only 3 prometheus metrics out

interfaces_interface_config_enabled
interfaces_interface_config_load_interval
interfaces_interface_config_mtu

Found this https://github.com/openconfig/gnmic/issues/343#issuecomment-1883649487 and trying to understand how to get string values into prometheus as labels, i.e this is raw get output from that path

./gnmi -addr blah.com:6030  get "/interfaces/interface[name=*]/openconfig-interfaces:config"
/interfaces/interface[name=Ethernet1/1]/config:
{
  "openconfig-interfaces:description": "Some_Interface_Desction_String_Here",
  "arista-intf-augments:load-interval": 30,
  "openconfig-interfaces:mtu": 0,
  "openconfig-interfaces:name": "Ethernet1/1",
  "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}
/interfaces/interface[name=Ethernet2/1]/config:
{
  "openconfig-interfaces:description": "[AVAILABLE]",
  "openconfig-interfaces:enabled": false,
  "openconfig-interfaces:mtu": 0,
  "openconfig-interfaces:name": "Ethernet2/1",
  "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}

Is there a workaround to get string values as labels into prometheus, something like this (for the example above):

interfaces_interface_config_description{interface_name="Ethernet1", source="blah.com", ifDescription="Some_Interface_Desction_String_Here" } 1

interfaces_interface_config_type{interface_name="Ethernet1", source="blah.com", ifType="iana-if-type:ethernetCsmacd" } 1

Basically the string values need to go in as labels and the actual prometheus value should be set to 1. In many cases, the metrics without those strings as labels are useless, for example without the interface description I can't do any proper query in prometheus, matching only specific interface descriptions, etc.

karimra commented 1 month ago

I think you have a typo in the config, it's append-subscription-name and strings-as-labels not append_subscription_name and strings_as_labels. The output you shared is not from gNMIc, that's another tool.

aned commented 1 month ago

Oh yeah, my bad, LLMs hallucinating big time :(. With the strings-as-labels: true it works as expected. I'm still seeing subscription_name="sub1" in metrics even though append-subscription-name: false.

aned commented 1 month ago

Are there any good config examples documented somewhere for networking boxes?

karimra commented 1 month ago

append-subscription-name appends the subscription name to the metric name, not as a label.

The subscription-name label is added by default. If you want to drop it, you can use an event-delete processor:

processors:
  delete-processor: <- # processor name
    event-delete: <- # processor type
      tag-names:
        - "subscription-name" # <- delete tags called `subscription-name`

outputs:
  prometheus:
    type: prometheus
    listen: :9273
    strings-as-labels: true
    event-processors:
      - delete-processor # <- processor name

aned commented 1 month ago

Understood, thank you @karimra!

aned commented 1 month ago

Sorry for bugging again @karimra , would you have some pointers as of why I can't prepend interface description in here. This is the raw output:

{
  "openconfig-interfaces:admin-status": "DOWN",
  "openconfig-interfaces:counters": {
    "carrier-transitions": "1",
    "in-broadcast-pkts": "0",
    "in-discards": "0",
    "in-errors": "0",
    "in-fcs-errors": "0",
    "in-multicast-pkts": "0",
    "in-octets": "0",
    "in-pkts": "0",
    "in-unicast-pkts": "0",
    "out-broadcast-pkts": "0",
    "out-discards": "0",
    "out-errors": "0",
    "out-multicast-pkts": "0",
    "out-octets": "0",
    "out-pkts": "0",
    "out-unicast-pkts": "0"
  },
  "openconfig-interfaces:description": "[AVAILABLE]",
  "openconfig-interfaces:enabled": false,
  "openconfig-platform-port:hardware-port": "Ethernet36-Port",
  "openconfig-interfaces:ifindex": 36001,
  "arista-intf-augments:inactive": false,
  "openconfig-interfaces:last-change": "1710978986275730133",
  "openconfig-interfaces:management": false,
  "openconfig-interfaces:mtu": 0,
  "openconfig-interfaces:name": "Ethernet36/1",
  "openconfig-interfaces:oper-status": "NOT_PRESENT",
  "openconfig-platform-transceiver:transceiver": "Ethernet36",
  "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}

I'm trying to add interface description to all related counters via this config

outputs:
  prometheus:
    type: prometheus
    listen: :9273
    append-subscription-name: false
    strings-as-labels: true # Converts string values to labels
    timeout: 30s
    event-processors:
      - add-int-description

processors:
  add-int-description:
    event-value-tag:
      value-name: "/interfaces/interface/state/description"
      tag-name: "ifAlias"
      consume: false
      debug: false

but none of the related counter metrics get a label ifAlias.

aned commented 1 month ago

Figured out that setting cache to -1 "fixes" it, but not sure if it's the correct approach, I see a lot of stale warnings in the debug mode.

2024/07/24 06:12:07.070613 /home/runner/work/gnmic/gnmic/pkg/cache/oc_cache.go:155: [cache:oc] failed to update gNMI cache: update is stale

Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node? How can I get the up metric per target, similar to what prometheus generates? Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?

targets:
  blah.com:
    address: blah.com:6030

subscriptions:
  sub1:
    paths:
       - "qos/interfaces/interface/output/queues/queue/state"
       - "/interfaces/interface[name=*]/state"
    stream-mode: sample
    sample-interval: 10s
    heartbeat-interval: 60s
    updates-only: false

outputs:
  prometheus:
    type: prometheus
    listen: :9273
    append-subscription-name: false
    strings-as-labels: true # Converts string values to labels
    timeout: 30s
    cache:
      type: oc
      expiration: -1
    event-processors:
      - rename-labels
      - add-int-description
      - delete-labels
      - rename-metrics
      - drop-metrics

processors:
  add-int-description:
    event-value-tag:
      value-name: "/interfaces/interface/state/description"
      tag-name: "ifAlias"
      consume: false  # if true, remove value from original event when copying
      debug: false

  delete-labels:
    event-delete:
      tag-names:
        - "subscription-name"
        - "target"

  drop-metrics:
    event-drop:
      value-names:
        - ".*/state/(type|transceiver|physical.*|mtu|management|inactive|ifindex|hardware.*|enabled)"

  rename-metrics:
    event-strings:
      value-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: "interfaces/interface/.*/description"
            new: "ifAlias"

  rename-labels:
    event-strings:
      tag-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: "source"
            new: "alias"
        - replace:
            apply-on: "name"
            old: "interface_name"
            new: "ifName"
        - replace:
            apply-on: "name"
            old: ".*interface-id"
            new: "ifName"

karimra commented 1 month ago

I see a lot of stale warnings in the debug mode.

Your router is sending notifications with old timestamps or the router and the machine where gNMIc is running are not in sync.

Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node?

Vague question, how many more metrics, how many paths, with which encoding and which sample interval? how "beefy" is a "beefy" node ? Only testing within your env will tell. It's probably easier and more manageable to run a cluster of gNMIc instances if you are going to subscribe to 1000s of routers, check here

How can I get the up metric per target, similar to what prometheus generates?

You can't.

Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?

gNMIc generates some internal metrics that can be scraped by Prometheus, you will be able to tell how many gNMI clients are connected, how many subscriptions there are per client, how many messages were received per client/per subscription.

Enable them under api-server

api-server:
  address: :7890
  enable-metrics: true

The endpoint is /metrics

And again, this is not a raw output:

{
  "openconfig-interfaces:admin-status": "DOWN",
  "openconfig-interfaces:counters": {
    "carrier-transitions": "1",
    "in-broadcast-pkts": "0",
    "in-discards": "0",
    "in-errors": "0",
    "in-fcs-errors": "0",
    "in-multicast-pkts": "0",
    "in-octets": "0",
    "in-pkts": "0",
    "in-unicast-pkts": "0",
    "out-broadcast-pkts": "0",
    "out-discards": "0",
    "out-errors": "0",
    "out-multicast-pkts": "0",
    "out-octets": "0",
    "out-pkts": "0",
    "out-unicast-pkts": "0"
  },
  "openconfig-interfaces:description": "[AVAILABLE]",
  "openconfig-interfaces:enabled": false,
  "openconfig-platform-port:hardware-port": "Ethernet36-Port",
  "openconfig-interfaces:ifindex": 36001,
  "arista-intf-augments:inactive": false,
  "openconfig-interfaces:last-change": "1710978986275730133",
  "openconfig-interfaces:management": false,
  "openconfig-interfaces:mtu": 0,
  "openconfig-interfaces:name": "Ethernet36/1",
  "openconfig-interfaces:oper-status": "NOT_PRESENT",
  "openconfig-platform-transceiver:transceiver": "Ethernet36",
  "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}

I'm not sure where you are getting it from but it does not tell me what gNMI messages the router is sending. It looks like you are running a get RPC to print the output, a subscribe RPC may send data in separate notifications. That's why the processor couldn't add the interface description as a label, it received the stats and the description in separate notifications. Enabling caching is probably an overkill solution for that. Check this processor instead: https://gnmic.openconfig.net/user_guide/event_processors/event_starlark/#set-an-interface-description-as-a-tag

peejaychilds commented 1 month ago

delete-labels: event-delete: tag-names:

"subscription-name"

"target"

If you remove the 'target' you won't be able to tell which target the metrics are for ?

aned commented 1 month ago

I see a lot of stale warnings in the debug mode.

Your router is sending notifications with old timestamps or the router and the machine where gNMIc is running are not in sync.

Without cache expiration: -1 i don't get any stale warnings.

Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node?

Vague question, how many more metrics, how many paths, with which encoding and which sample interval? how "beefy" is a "beefy" node ? Only testing within your env will tell. It's probably easier and more manageable to run a cluster of gNMIc instances if you are going to subscribe to 1000s of routers, check here

Probably ~2k total metrics per device, ~10 paths, 30s scraping interval (how do I check which encoding is being used and how does it play a role?). 256G, 12cpu amd node. Thanks for the HA reference, will check it out, I'm still in testing stage.

How can I get the up metric per target, similar to what prometheus generates?

You can't.

Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?

gNMIc generates some internal metrics that can be scraped by Prometheus, you will be able to tell how many gNMI clients are connected, how many subscriptions there are per client, how many messages were received per client/per subscription.

Enable them under api-server
api-server:
  address: :7890
  enable-metrics: true
The endpoint is /metrics

Thank you! And again, this is not a raw output:
{
  "openconfig-interfaces:admin-status": "DOWN",
  "openconfig-interfaces:counters": {
    "carrier-transitions": "1",
    "in-broadcast-pkts": "0",
    "in-discards": "0",
    "in-errors": "0",
    "in-fcs-errors": "0",
    "in-multicast-pkts": "0",
    "in-octets": "0",
    "in-pkts": "0",
    "in-unicast-pkts": "0",
    "out-broadcast-pkts": "0",
    "out-discards": "0",
    "out-errors": "0",
    "out-multicast-pkts": "0",
    "out-octets": "0",
    "out-pkts": "0",
    "out-unicast-pkts": "0"
  },
  "openconfig-interfaces:description": "[AVAILABLE]",
  "openconfig-interfaces:enabled": false,
  "openconfig-platform-port:hardware-port": "Ethernet36-Port",
  "openconfig-interfaces:ifindex": 36001,
  "arista-intf-augments:inactive": false,
  "openconfig-interfaces:last-change": "1710978986275730133",
  "openconfig-interfaces:management": false,
  "openconfig-interfaces:mtu": 0,
  "openconfig-interfaces:name": "Ethernet36/1",
  "openconfig-interfaces:oper-status": "NOT_PRESENT",
  "openconfig-platform-transceiver:transceiver": "Ethernet36",
  "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}
I'm not sure where you are getting it from but it does not tell me what gNMI messages the router is sending. It looks like you are running a get RPC to print the output, a subscribe RPC may send data in separate notifications. That's why the processor couldn't add the interface description as a label, it received the stats and the description in separate notifications. Enabling caching is probably an overkill solution for that. Check this processor instead: https://gnmic.openconfig.net/user_guide/event_processors/event_starlark/#set-an-interface-description-as-a-tag

Yes, that was the output of gnmi get cli, I thought they'd be the same as subscribe. Will definitely check out event_starlark, caching seems to eat up way more resources.

aned commented 1 month ago

delete-labels: event-delete: tag-names:

"subscription-name"

"target"

If you remove the 'target' you won't be able to tell which target the metrics are for ?

Some metrics had the same label value in source and target, seemed duplicate, will pay attention if it causes any issues.

hellt commented 1 month ago

@aned in your pastings you had

./gnmi -addr blah.com:6030  get "/interfaces/interface[name=*]/openconfig-interfaces:config"

Note, that this tool ./gnmi is not gnmic. Outputs from this tool doesn't help to correlate what you get with it and gnmic outputs. I suggest you use gnmic solely.

karimra commented 1 month ago

Without cache expiration: -1 i don't get any stale warnings.

Do you mean With ? -1 disables cache expiration so it disables timestamp checking it doesn't mean the stale issue is gone. By setting expiration: -1 you will get old values at every scrape (e.g deleted interfaces)

Probably ~2k total metrics per device, ~10 paths, 30s scraping interval.

What I mean here is that everyone's case is different, it's hard to give a scaling reference. There is no way to know how many individual single metrics are behind a path, there could be multiple containers and lists behind a path.

how do I check which encoding is being used and how does it play a role?

ref: https://github.com/openconfig/reference/blob/master/rpc/gnmi/gnmi-specification.md#23-structured-data-types

If you didn't specify an encoding it uses JSON by default. The encoding determines the returned values type. While JSON and JSON_IETF allow the router to send values together as a JSON object (potentially lower message rate, higher message size) other encodings send each values as a single update (multiple updates can be bundled in a single SubscribeResponse (potentially higher message rate, lower message size) You should check what works best for your devices and the metrics you are collecting.

Start small, add a few devices and extrapolate from there.

aned commented 1 month ago

Understood, thank you for your help guys!

openconfig / gnmic

Prometheus metrics issue for string values #492