openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
170 stars 55 forks source link

Starlark LLDP script breaks prometheus output #495

Closed aned closed 1 month ago

aned commented 1 month ago

I'm probably doing something wrong in this script, it's not a bug with the implementation. This is the config I'm trying to use, currently it just breaks prometheus's /metrics

subscriptions:
  sub1:
    paths:
       - "/lldp/state/system-name"
       - "/lldp/interfaces/interface/state/name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: sample
    sample-interval: 30s
    heartbeat-interval: 30s
    updates-only: false

outputs:
  prometheus:
    type: prometheus
    listen: :9273
    append-subscription-name: false
    strings-as-labels: true 
    timeout: 30s
    event-processors:
      - starlark-add-lldp-info

processors:
  starlark-add-lldp-info:
    event-starlark:
      source: |
          cache = {}

          def apply(*events):
              evs = []
              for e in events:
                  local_system_name_path = "/lldp/state/system-name"
                  local_port_id_path = "/lldp/interfaces/interface/state/name"
                  remote_system_name_path = "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
                  remote_port_id_path = "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"

                  if e.values.get(local_system_name_path) and e.values.get(local_port_id_path) and e.values.get(remote_system_name_path) and e.values.get(remote_port_id_path):
                      local_device_lldp_name = e.values[local_system_name_path]
                      local_port_id = e.values[local_port_id_path]
                      remote_device_lldp_name = e.values[remote_system_name_path]
                      remote_port_id = e.values[remote_port_id_path]

                      # Create a sorted lldpName label and use cache
                      local_key = f"{local_device_lldp_name}:{local_port_id}"
                      remote_key = f"{remote_device_lldp_name}:{remote_port_id}"

                      if local_key not in cache and remote_key not in cache:
                          if local_device_lldp_name < remote_device_lldp_name:
                              lldp_name = f"{local_device_lldp_name}:{local_port_id}:{remote_device_lldp_name}:{remote_port_id}"
                          else:
                              lldp_name = f"{remote_device_lldp_name}:{remote_port_id}:{local_device_lldp_name}:{local_port_id}"
                          cache[local_key] = lldp_name
                          cache[remote_key] = lldp_name
                      else:
                          lldp_name = cache.get(local_key) or cache.get(remote_key)

                      new_event = e.clone()
                      new_event.name = "lldp_local_mapping"
                      new_event.values = {"lldp_local_mapping": 1}
                      new_event.tags["lldpName"] = lldp_name
                      new_event.tags["lldpRemSysName"] = remote_device_lldp_name
                      new_event.tags["lldpLocalSysName"] = local_device_lldp_name
                      new_event.tags["lldpLocalPortId"] = local_port_id

                      evs.append(new_event)
                  else:
                      evs.append(e)
              return evs

The idea in a nutshell is to create a new metric in a format

lldp_local_mapping with the following labels: 

- lldpName = {local_device_lldp_name}:{local_port_id}:{remote_device_lldp_name}:{remote_port_id}"
- lldpRemSysName
- lldpLocalSysName
- lldpLocalPortId

lldpName needs to be sorted by name on creation, so it's the same on both devices: local and remote. This lldpName label will be used to correlate device/port on both sides for alerts, etc.

Any pointers to what's wrong with the script above?

peejaychilds commented 1 month ago
              if e.values.get(local_system_name_path) and e.values.get(local_port_id_path) and e.values.get(remote_system_name_path) and e.values.get(remote_port_id_path):

Are you expecting the event has all these values in the same event?

On a Juniper QFX5130 with this set of subscription paths each item seems to come as its own event?

When you says it 'breaks' /metrics can you provide more information?

You can add some tracing in your starlark which can output on the console for assistance in debugging?

ie

print(e)

I'm not sure I have tried e.clone() to create a new event, but I have tried something like new_event = Event("enhanced_interface_counters", e.timestamp, e.tags)

peejaychilds commented 1 month ago

This worked for me ... you might need to rework it but the idea being you cache the stuff you need until you have all of it (because the neighbor port-id and neighbor system-name are in different events) before you create your new event

logging = True
cache = {}

def apply(*events):
  # events we return
  ret_events = []

  for e in events:
    if e.name == 'lldp':
      if logging:
        print(e)
      # remove port part of target
      hostname = e.tags['source'].replace( ':9339', '' )
      if e.tags.get('neighbor_id'):
        cache_key = hostname + '_' + e.tags['interface_name'] + '_' + e.tags['neighbor_id']
        if cache_key in cache:
          if logging:
            print( 'cache hit for ' + cache_key )
        else:
          if logging:
            print( '** new cache slot for ' + cache_key )
          cache[cache_key] = {}

        if e.values.get( '/lldp/interfaces/interface/neighbors/neighbor/state/port-id' ):
          cache[cache_key]['remote-port'] = e.values.get( '/lldp/interfaces/interface/neighbors/neighbor/state/port-id' )
        if e.values.get( '/lldp/interfaces/interface/neighbors/neighbor/state/system-name' ):
          cache[cache_key]['remote-name'] = e.values.get( '/lldp/interfaces/interface/neighbors/neighbor/state/system-name' )
        cache[cache_key]['local-name'] = hostname
        cache[cache_key]['local-port'] = e.tags['interface_name']

        if 'remote-name' in cache[cache_key] and 'remote-port' in cache[cache_key]:
          new_event = Event("lldp_inventory", e.timestamp )
          for key,value in cache[cache_key].items():
            new_event.tags[key] = value
          new_event.values = { 'lldp_item' : 1 }
          ret_events.append( new_event )
          if logging:
            print( new_event )
    else:
      ret_events.append( e )
  return ret_events

gives me

# HELP lldp_item gNMIc generated metric
# TYPE lldp_item untyped
lldp_item{local_name="bl01.laba.lab",local_port="et-0/0/0",remote_name="sp01.01.laba",remote_port="et-0/0/28"} 1
lldp_item{local_name="bl01.laba.lab",local_port="et-0/0/1",remote_name="sp02.01.laba",remote_port="et-0/0/28"} 1
lldp_item{local_name="bl01.laba.lab",local_port="et-0/0/28:0",remote_name="dr03.laba",remote_port="et-2/2/3"} 1
lldp_item{local_name="bl01.laba.lab",local_port="et-0/0/28:2",remote_name="dr04.laba",remote_port="et-2/2/3"} 1
lldp_item{local_name="bl01.laba.lab",local_port="re0:mgmt-0",remote_name="rsw01.r1.05.laba",remote_port="ge-0/0/14"} 1
aned commented 1 month ago

Thanks @peejaychilds for looking into it. Got your point about regarding caching the stuff I need until I have all of it for processing. Trying to troubleshoot (you script doesn't break things but doesn't return lldp_item), where does it send the logs? I'm running it in cli ./gnmic --config gnmic-config.yaml subscribe --tls-cert ./1.pem --tls-key ./1-key.pem --skip-verify -d but it's a bit too much gibberish and I don't see any logs from the script itself.

peejaychilds commented 1 month ago

I believe it is the log: true option that makes output to stdout

My gnmic.yaml has

skip-verify: true
encoding: json
log: true

loader:
  type: file
  # path to the file
  path: /opt/gnmi/targets/targets.yaml
  enable-metrics: true

subscriptions:
  lldp:
    paths:
      - /lldp/interfaces/interface/neighbors/neighbor/state/system-name
      - /lldp/interfaces/interface/neighbors/neighbor/state/port-id
    stream-mode: sample
    sample-interval: 60s

outputs:
  prometheus:
    type: prometheus
    listen: :8080
    append-subscription-name: false
    strings-as-labels: true
    timeout: 30s
    event-processors:
      - proc-test

processors:
  proc-test:
    event-starlark:
      script: /opt/gnmi/test.star

I run gnmic --config /opt/gnmi/gnmic.yaml subscribe and see things like

...
2024/07/30 00:09:13.940595 [event-starlark] print(): {"name":"lldp","timestamp":1722298153765731172,"tags":{"interface_name":"re0:mgmt-0","neighbor_id":"50:c7:09:a7:c3:f1-ge-0/0/17","source":"sp01.01.laba.lab:9339","subscription-name":"lldp"},"values":{"/lldp/interfaces/interface/neighbors/neighbor/state/port-id":"ge-0/0/17"}}
2024/07/30 00:09:13.940620 [event-starlark] print(): cache hit for sp01.01.laba.lab_re0:mgmt-0_50:c7:09:a7:c3:f1-ge-0/0/17
2024/07/30 00:09:13.940664 [event-starlark] print(): {"name":"lldp_inventory","timestamp":1722298153765731172,"tags":{"local-name":"sp01.01.laba.lab","local-port":"re0:mgmt-0","remote-name":"rsw01.r1.05.laba","remote-port":"ge-0/0/17"},"values":{"lldp_item":1}}
...
aned commented 1 month ago

Thanks, now I have something to work with!

aned commented 1 month ago

This seems to do the trick:

  starlark-add-lldp-info-2:
    event-starlark:
      source: |
        logging = True
        cache = {}

        def apply(*events):
          # events we return
          ret_events = []

          for e in events:
            if e.name == 'lldp':
              if logging:
                print(e)
              # remove port part of target
              hostname = e.tags['source'].replace(':6030', '')
              if e.tags.get('neighbor_id'):
                cache_key = hostname + '_' + e.tags['interface_name'] + '_' + e.tags['neighbor_id']
                if cache_key in cache:
                  if logging:
                    print('cache hit for ' + cache_key)
                else:
                  if logging:
                    print('** new cache slot for ' + cache_key)
                  cache[cache_key] = {}

                if e.values.get('/lldp/interfaces/interface/neighbors/neighbor/state/port-id'):
                  cache[cache_key]['remote-port'] = e.values.get('/lldp/interfaces/interface/neighbors/neighbor/state/port-id')
                if e.values.get('/lldp/interfaces/interface/neighbors/neighbor/state/system-name'):
                  cache[cache_key]['remote-name'] = e.values.get('/lldp/interfaces/interface/neighbors/neighbor/state/system-name')
                cache[cache_key]['local-name'] = hostname
                cache[cache_key]['local-port'] = e.tags['interface_name']

                if 'remote-name' in cache[cache_key] and 'remote-port' in cache[cache_key]:
                  local_name = cache[cache_key]['local-name']
                  local_port = cache[cache_key]['local-port']
                  remote_name = cache[cache_key]['remote-name']
                  remote_port = cache[cache_key]['remote-port']

                  # Sort names to ensure consistent lldpName
                  if local_name < remote_name:
                    lldp_name = local_name + ":" + local_port + ":" + remote_name + ":" + remote_port
                  else:
                    lldp_name = remote_name + ":" + remote_port + ":" + local_name + ":" + local_port

                  # Create a new event manually
                  new_event = Event("lldp_inventory", e.timestamp)
                  new_event.values = {'lldp_local_mapping': 1}
                  new_event.tags = {}
                  new_event.tags.update(cache[cache_key])
                  new_event.tags['lldpName'] = lldp_name
                  ret_events.append(new_event)

                  if logging:
                    print(new_event)
            else:
              ret_events.append(e)
          return ret_events

On a side note, is there an efficient way in gnmic (starlark or some other existing processor) to change metric value based on label value? For example, with the default interface status, not much one can do

interface_state_oper_status{alias="blah.com",ifAlias="SomeName",ifName="Ethernet1/1",oper_status="NOT_PRESENT"} 1
interface_state_oper_status{alias="blah.com",ifAlias="SomeOtherName",ifName="Ethernet2/1",oper_status="UP"} 1

I need to convert it to

interface_state_oper_status{alias="blah.com",ifAlias="SomeName",ifName="Ethernet1/1",oper_status="NOT_PRESENT"} 0
interface_state_oper_status{alias="blah.com",ifAlias="SomeOtherName",ifName="Ethernet2/1",oper_status="UP"} 1

(based on oper_status label value)

karimra commented 1 month ago

On a side note, is there an efficient way in gnmic (starlark or some other existing processor) to change metric value based on label value? For example, with the default interface status, not much one can do interface_state_oper_status{alias="blah.com",ifAlias="SomeName",ifName="Ethernet1/1",oper_status="NOT_PRESENT"} 1 interface_state_oper_status{alias="blah.com",ifAlias="SomeOtherName",ifName="Ethernet2/1",oper_status="UP"} 1 I need to convert it to interface_state_oper_status{alias="blah.com",ifAlias="SomeName",ifName="Ethernet1/1",oper_status="NOT_PRESENT"} 0 interface_state_oper_status{alias="blah.com",ifAlias="SomeOtherName",ifName="Ethernet2/1",oper_status="UP"} 1 (based on oper_status label value)

You should just map the oper_status value to a numerical value, instead of making it a label.

processors:
  # processor name
  map-oper-status:
    # processor type
    event-strings:
      value-names:
        - "/interface/state/oper-status"
      transforms:
        # strings function name
        - replace:
            apply-on: "value"
            old: "NOT_PRESENT"
            new: "0"
        - replace:
            apply-on: "value"
            old: "UP"
            new: "1"
aned commented 1 month ago

Pretty neat, thanks!