Performance optimisation for prometheus output

aned commented 1 month ago

I'm testing this config file https://gist.github.com/aned/8b68e77791dc3bb9eeda903ce54e1643 After adding ~30 targets, I'm seeing some pretty heavy load on the server. Are there any obvious improvements I can make in the config?

In this section

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: sample
    sample-interval: 30s
    heartbeat-interval: 30s
    updates-only: false

I'm caching lldp, it doesn't change much so need to do 30s updates, would it break things if sample-interval is set to like 1h or it needs to be done via cache expiration?

karimra commented 1 month ago

I see you went all out with the processors :)

The first obvious change I would make is moving the drop-metrics up in the list of processors under the output. If the event messages are going to be dropped they don't need to travel down the pipeline of processors. Unless you are using them to enrich other values (I didn't see that in the starlark processors)

As for the lldp subscription, If it's not going to change much, use an on-change subscription (if the router supports it)

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: on-change

How much of a heavy load are we talking about ? I see you enabled the api-server metrics and you have a Prometheus server.

api-server:
  address: :7890
  enable-metrics: true

Do you have a target definition for gnmic:7890/metrics ? We will be able to see how much (and what) is being used.

A few more optimisations:

1) This processor matches ALL your events, and runs the old regex on all of their values.

  rename-metrics:
    event-strings:
      value-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: "interfaces/interface/.*/description"
            new: "ifAlias"

If you know exactly what you are going to replace, set it in the matching section not in the transform

  rename-metrics:
    event-strings:
      value-names:
        - "interfaces/interface/.*/description"
      transforms:
        - replace:
            apply-on: "name"
            old: "interfaces/interface/.*/description"
            new: "ifAlias"

2)

Same for this processor.

  rename-metrics-arista-ngbb:
    event-strings:
      value-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/packetLoss"
            new: "PacketLossAristaXBR"
        - replace:
            apply-on: "name"
            old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/latency"
            new: "LatencyAristaXBR"
        - replace:
            apply-on: "name"
            old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/jitter"
            new: "JitterAristaXBR"
        - replace:
            apply-on: "name"
            old: ".*meminfo/memTotal"
            new: "MemTotalAristaXBR"
        - replace:
            apply-on: "name"
            old: ".*meminfo/memAvailable"
            new: "MemAvailableAristaXBR"
        - replace:
            apply-on: "name"
            old: "/queues/queue"
            new: "_queue"
        - trim-prefix:
            apply-on: "name"
            prefix: "/interfaces"
        - trim-prefix:
            apply-on: "name"
            prefix: "/qos/interfaces"

All these transforms are independent from each other. The transforms in a single event-strings processor are applied the all the event messages in sequence. So I would create separate processors for each one.

3)

In this processor, the old tags are well known

rename-labels-interface:
  event-strings:
    tag-names:
      - ".*"
    transforms:
      - replace:
          apply-on: "name"
          old: "source"
          new: "alias"
      - replace:
          apply-on: "name"
          old: "interface_name"
          new: "ifName"
      - replace:
          apply-on: "name"
          old: ".*interface-id"
          new: "ifName"

I would place them in the tag-names field or even create a processor for each one.

rename-labels-interface:
    event-strings:
      tag-names:
        - "source"
        - "interface_name"
        - .*interface-id"
      transforms:
        - replace:
            apply-on: "name"
            old: "source"
            new: "alias"
        - replace:
            apply-on: "name"
            old: "interface_name"
            new: "ifName"
        - replace:
            apply-on: "name"
            old: ".*interface-id"
            new: "ifName"

There are a couple more processors like this, I think you get the idea. you can save a lot by skipping a few regex evaluations (over 30 routers).

aned commented 1 month ago

Got it, thanks for the inputs! I went from 17 to 32 targets, updated as suggested above, seems reasonable, will do more tweaking but it's got the potential!

aned commented 1 month ago

How does num-workers: 5 affect things in the outputs configuration?

karimra commented 1 month ago

How does num-workers: 5 affect things in the outputs configuration?

It defines the number of parallel routines reading gNMI notifications from the target's buffer and converting them into Prometheus metrics. It's supposed to help deal with high rate of notifications. Looking at the dashboards you shared, I think you might benefit from more workers. It would reduce the total Goroutines you have running.

aned commented 1 month ago

Understood, I bumped it to 10, seeing some marginal improvements. What's the "recommended" number of workers or how to find the optimal number?

karimra commented 1 month ago

There is no recommended number really. It depends on the pattern (rate and size) of the updates you are getting. I would aim at lowering the number of goroutines running and keep it stable over multiple sample intervals. The optimal number depends on whether you are optimizing for mem or cpu ? If you want to reduce memory, add more workers so that notifications are not hanging in memory waiting to be processed. If you want to reduce cpu usage, reduce the number of workers, but you will be using more memory.

aned commented 1 month ago

Got it! In terms of monitoring targets (can't subscribe due to auth issues, potential acl issues, etc) , from what I can see in the api /metrics endpoint

api-server:
  address: :7890
  enable-metrics: true

I could only use something like

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

but it looks like this metrics disappears for a specific source if gnmic can't connect to it anymore. How do you folks monitor it?

This could be used

rate(grpc_client_handled_total{job=~"$job_name"}[2m]) > x

but doesn't tell me which target is erroring though.

karimra commented 1 month ago

Currently, this is your best bet:

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

Can't that metric default to zero if it's not returned ?

aned commented 1 month ago

No, once the box becomes gnmi unreachable, all those metrics disappear, they don't become 0. It'd only work if the gnmi connection stays up.

karimra commented 4 weeks ago

some sort of temporary workaround here: https://github.com/openconfig/gnmic/issues/419#issuecomment-2288642468

aned commented 3 weeks ago

Raised a feature request #513 .

openconfig / gnmic

Performance optimisation for prometheus output #498