Deploying gNMIc on Kubernetes clustered - loosing targets

pboers1988 commented 1 week ago

When deploying gNMIc clustered on Kubernetes I'm running into the following problem: When K8s reschedules a pod to a different node or for whatever reason triggers a restart on a pod (resource constraints) the clustering process looses targets and does not recover. What I have to do to recover from this is to delete the leader lock (in consul or the k8s lease) and this will cause the cluster to rebalance and re-acquire targets. I'm deploying the gNMIc process as a statefulset and collecting roughly 22k metrics per second from around 350 nodes so this deployment is resource intensive.

Are there any other users who have an idea how I could deploy gnmic in a better way?

My pipeline is as follows:

Routers -> gNMIc collector -> Kafka -> gNMIc relay -> Influxdb

The collector configuration

api-server:
  address: :7890
  cache:
    address: redis-master.production.svc.cluster.local:6379
    type: redis
  debug: false
  enable-metrics: true
debug: false
encoding: proto
format: event
gzip: false
password: ${GNMIC_PASSWORD}
skip-verify: true
username: ${GNMIC_USERNAME}
outputs:
  k8s-cluster:
    address: ${KAFKA_BOOSTRAP_SERVER}
    cache:
      address: redis-master.production.svc.cluster.local:6379
      expiration: 60s
      type: redis
    event-processors:
    - group-by-interface-and-source
    group-id: ${KAFKA_GROUP}
    sasl:
      mechanism: ${KAFKA_AUTH_MECH}
      password: ${KAFKA_PASSWORD}
      user: ${KAFKA_USERNAME}
    tls:
      skip-verify: true
    topic: ${KAFKA_TOPIC}
    type: kafka
processors:
  group-by-interface-and-source:
    event-group-by:
      tags:
      - interface_name
      - source
subscriptions:
  components:
    mode: stream
    paths:
    - /components
    stream-mode: target-defined
  network_instances_bgp:
    mode: stream
    paths:
    - /network-instances/network-instance/protocols/protocol/bgp
    - /network-instances/network-instance/interfaces
    - /network-instances/network-instance/state
    stream-mode: target-defined
  port_stats:
    mode: stream
    paths:
    - /interfaces
    stream-mode: target-defined
  system:
    mode: stream
    paths:
    - /system
    stream-mode: target-defined
targets:
  target1t:
    address: target1
    subscriptions:
    - port_stats
    - network_instances_bgp
    - components
    - system

 <snip >

api: ":7890"
clustering:
  cluster-name: gnmic-collector
  locker:
    type: consul
    address: gnmic-consul-svc:8500
gnmi-server:
  address: ":57400"
  debug: false
  cache:
    type: redis
    address: redis-master.production.svc.cluster.local:6379

peejaychilds commented 1 week ago

Sorry I probably won't be of much help but I'm always interested in how others are deploying. We have k8s deployment for a proof-of-concept currently.

Was thinking of consul or a cluster -- how do you delete the leader lock?

Our pipeline is Routers -> gNMIc collector -> telegraf -> Influxdb with kapacitor doing CQs for rollup (into 5 min samples for strategic data collection for items we want a longer non-tactical view)

The telegaf is a sidecar with a health check that dies if the buffer is more than x,000 records - so we get a bit of buffering available if we momentarily loose connection to influx etc

I use a bash script to statically assign devices > stateful set nodes via hard coded yaml in the pods. Not particularly flexible but ok for proof-of-concept and we don't add/remove devices super often.

If a pod restarts well it restarts and we will loose telemetry until it comes back

I have a bunch of different device types and profiles and the script spreads them over the pods in a deterministic way ... so I know for a zone/region that devices of type X will be on zone Y's pod # 3 etc. So monitoring pod resources and telemetry means we get pretty consistent graphs.

We run a 'A' telemetry stack and a 'B' telemetry stack which are independent but poll all the devices so if we blow a influx or loose a storage DC we have another for tactical purposes - and we stage upgrades in production one side at a time etc...

743 devices currently. 8k/points second -- we filter any stat not in a specific allow list - pre-prod gets everything, prod drops most of the metrics from things like /interfaces unless specifically allow listed. We are trying to 're-work' from a point where we had JTI telemetry doing 60k/sec for 70 devices so we don't metric things we don't need/use.

pboers1988 commented 1 week ago

We deploy the gNMIc cluster using this chart: https://github.com/workfloworchestrator/gnmic-cluster-chart following the instructions here: https://gnmic.openconfig.net/user_guide/HA/

As for deleting the leader lock its relatively simple, when using K8S you need to delete the lease:

k delete lease -n streaming gnmic-collector-leader

We are running on AKS and were running into Kube-API ratelimit issues, when the cluster leaders was managing the leases. We decided to use the other option consul to store the state of the quorum. Consul has a web interface that you can browse to, you can manage what is stored in consul there, and edit/delete stuff.

The clustering mode of gNMIc trusts the cluster leader to dispatch targets to separate workers.

openconfig / gnmic

Deploying gNMIc on Kubernetes clustered - loosing targets #526

The collector configuration