openconfig / gnmi-gateway

A modular, distributed, and highly available service for modern network telemetry via OpenConfig and gNMI
Apache License 2.0
137 stars 32 forks source link

Get "cache update error: update is stale," error after receiving subscription notification from a target. #24

Closed chrispsommers closed 3 years ago

chrispsommers commented 3 years ago

Bug description Get "cache update error: update is stale," error after receiving subscription notification from a target.

Step to reproduce

  1. targets.json:
    {
    "request": {
    "default": {
      "subscribe": {
        "prefix": {
        },
        "subscription": [
          {
            "path": {
              "elem": [
                {
                  "name": "interfaces"
                }
              ]
            }
          }
        ]
      }
    }
    },
    "target": {
    "athena-gnmi-server": {
      "addresses": [
        "localhost:8080"
      ],
      "credentials": {
        "username": "admin",
        "password": "admin"
      },
      "request": "default",
      "meta": {
        "NoTLSVerify": "yes"
      }
    }
    }
    }
  2. Execute as follows (using host network for simplicity):
    $ docker run     -it --rm     -p 59100:59100     -v $(pwd)/examples/gnmi-prometheus/targets.json:/opt/gnmi-gateway/targets.json

Expected behavior Receive notifications without error messages, expose to northbound.

gnmi-gateway logs:

WARNING: Published ports are discarded when using host network mode
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Starting GNMI Gateway."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Clustering is NOT enabled. No locking or cluster coordination will happen."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Starting connection manager."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Starting gNMI server on 0.0.0.0:9339."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Starting Prometheus exporter."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Connection manager received a target control message: 1 inserts 0 removes"}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Initializing target athena-gnmi-server ([localhost:8080]) map[NoTLSVerify:yes]."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Target athena-gnmi-server: Connecting"}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Target athena-gnmi-server: Subscribing"}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Target athena-gnmi-server: Connected"}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Starting Prometheus HTTP server."}
{"level":"info","time":"2021-01-22T02:26:40Z","message":"Target athena-gnmi-server: Disconnected"}
E0122 02:26:40.826565       1 reconnect.go:114] client.Subscribe (target "athena-gnmi-server") failed: target 'athena-gnmi-server' cache update error: update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale, update is stale: { timestamp:1611282400 prefix:/athena-gnmi-server update:[ { path:/interfaces/interface[name=tx]/state/counters/in-octets val:&{2048000000} } { path:/interfaces/interface[name=tx]/state/counters/out-octets val:&{1518000000} } { path:/interfaces/interface[name=tx]/state/counters/out-pkts val:&{4000000} } { path:/interfaces/interface[name=tx]/config/description val:&{localhost:5555;1} } { path:/interfaces/interface[name=tx]/ethernet/state/counters/in-crc-errors val:&{0} } { path:/interfaces/interface[name=tx]/config/type val:&{ethernetCsmacd} } { path:/interfaces/interface[name=tx]/state/out-rate val:&{[0 0 0 0]} } { path:/interfaces/interface[name=tx]/state/counters/in-pkts val:&{4000000} } { path:/interfaces/interface[name=tx]/ethernet/config/port-speed val:&{SPEED_1GB} } { path:/interfaces/interface[name=tx]/state/in-rate val:&{[0 0 0 0]} } { path:/interfaces/interface[name=tx]/config/name val:&{tx} } { path:/interfaces/interface[name=tx]/name val:&{tx} } { path:/interfaces/interface[name=tx]/state/oper-status val:&{UP} } ] }; reconnecting in 552.330144ms
{"level":"info","time":"2021-01-22T02:26:41Z","message":"Target athena-gnmi-server: Connected"}
{"level":"info","time":"2021-01-22T02:26:41Z","message":"Target athena-gnmi-server: Disconnected"}

Target logs (partial) showing notifications being sent to server:

I0121 18:38:42.289725    8200 data_client.go:306] Notifications: [timestamp:1611283122 update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"out-rate"}} val:{bytes_val:"\x00\x00\x00\x00"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-pkts"}} val:{uint_val:4000002}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"out-octets"}} val:{uint_val:1518000000}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"out-pkts"}} val:{uint_val:4000002}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"description"}} val:{string_val:"localhost:5555;1"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"ethernet"} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-crc-errors"}} val:{uint_val:0}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"ethernet"} elem:{name:"config"} elem:{name:"port-speed"}} val:{string_val:"SPEED_1GB"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"name"}} val:{string_val:"tx"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"type"}} val:{string_val:"ethernetCsmacd"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-octets"}} val:{uint_val:2048000414}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"in-rate"}} val:{bytes_val:"\x00\x00\x00\x00"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"name"}} val:{string_val:"tx"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"oper-status"}} val:{string_val:"UP"}}]. Err <nil>

Environment details

Additional context I'm able to subscribe to this client using gnmi_cli, see below. Note this version is patched to support some of the options below, not sure it matters).

Patch inside gnmi_cli source dir:

  wget https://raw.githubusercontent.com/Azure/sonic-telemetry/master/patches/gnmi_cli.all.patch; patch -p4 < gnmi_cli.all.patch

Subscribe result:

$ gnmi_cli -a localhost:8080 -logtostderr -insecure -t ATHENA  -query "/device/interfaces"  -qt s -streaming_type SAMPLE  -streaming_sample_interval 2 -v 5 -display_type s
I0121 18:43:35.681664    8544 register.go:113] Attempting client types: [gnmi]
I0121 18:43:35.741912    8544 register.go:126] client "gnmi" create with type *client.Client
I0121 18:43:35.752914    8544 client.go:186] update:{timestamp:1611283415 update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-octets"}} val:{uint_val:2048000414}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"ethernet"} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-crc-errors"}} val:{uint_val:0}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"ethernet"} elem:{name:"config"} elem:{name:"port-speed"}} val:{string_val:"SPEED_1GB"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"name"}} val:{string_val:"tx"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"out-rate"}} val:{bytes_val:"\x00\x00\x00\x00"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"in-pkts"}} val:{uint_val:4000002}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"out-octets"}} val:{uint_val:1518000000}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"counters"} elem:{name:"out-pkts"}} val:{uint_val:4000002}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"description"}} val:{string_val:"localhost:5555;1"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"in-rate"}} val:{bytes_val:"\x00\x00\x00\x00"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"name"}} val:{string_val:"tx"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"state"} elem:{name:"oper-status"}} val:{string_val:"UP"}} update:{path:{elem:{name:"interfaces"} elem:{name:"interface" key:{key:"name" value:"tx"}} elem:{name:"config"} elem:{name:"type"}} val:{string_val:"ethernetCsmacd"}}}
interfaces/interface/tx/state/counters/in-octets, 2048000414
interfaces/interface/tx/ethernet/state/counters/in-crc-errors, 0
interfaces/interface/tx/ethernet/config/port-speed, SPEED_1GB
interfaces/interface/tx/config/name, tx
interfaces/interface/tx/state/out-rate, [0 0 0 0]
interfaces/interface/tx/state/counters/in-pkts, 4000002
interfaces/interface/tx/state/counters/out-octets, 1518000000
interfaces/interface/tx/state/counters/out-pkts, 4000002
interfaces/interface/tx/config/description, localhost:5555;1
interfaces/interface/tx/state/in-rate, [0 0 0 0]
interfaces/interface/tx/name, tx
interfaces/interface/tx/state/oper-status, UP
interfaces/interface/tx/config/type, ethernetCsmacd
colinmcintosh commented 3 years ago

Hi @chrispsommers, thanks for the issue report! It looks like for your example case there are multiple updates in a single gNMI notification and gnmi-gateway wasn't correctly handling errors for multiple updates. I made a fix for the error handling process but after reading through your issue a second time I'm not sure that my fix actually covers your concern regarding the expected behavior:

Receive notifications without error messages, expose to northbound.

My change will allow notifications to be received and the connection to be stable but won't do anything else besides continuing to drop stale updates (i.e. won't forward stale updates to clients or Exporters). This is actually a function of the cache from the gnmi library and not something the gateway has direct control over without making multiple copies of the notifications.

Can you clarify what details you expect to be exposed when a stale update is received? Or if the fix I'm going to merge in covers your concerns feel free to close out the issue too.

colinmcintosh commented 3 years ago

I'm going to close this issue. Feel free to re-open this or comment if the merged commits link above don't resolve your issue.