Open pboers1988 opened 1 week ago
Sorry I probably won't be of much help but I'm always interested in how others are deploying. We have k8s deployment for a proof-of-concept currently.
Was thinking of consul or a cluster -- how do you delete the leader lock?
Our pipeline is Routers -> gNMIc collector -> telegraf -> Influxdb with kapacitor doing CQs for rollup (into 5 min samples for strategic data collection for items we want a longer non-tactical view)
The telegaf is a sidecar with a health check that dies if the buffer is more than x,000 records - so we get a bit of buffering available if we momentarily loose connection to influx etc
I use a bash script to statically assign devices > stateful set nodes via hard coded yaml in the pods. Not particularly flexible but ok for proof-of-concept and we don't add/remove devices super often.
If a pod restarts well it restarts and we will loose telemetry until it comes back
I have a bunch of different device types and profiles and the script spreads them over the pods in a deterministic way ... so I know for a zone/region that devices of type X will be on zone Y's pod # 3 etc. So monitoring pod resources and telemetry means we get pretty consistent graphs.
We run a 'A' telemetry stack and a 'B' telemetry stack which are independent but poll all the devices so if we blow a influx or loose a storage DC we have another for tactical purposes - and we stage upgrades in production one side at a time etc...
743 devices currently. 8k/points second -- we filter any stat not in a specific allow list - pre-prod gets everything, prod drops most of the metrics from things like /interfaces unless specifically allow listed. We are trying to 're-work' from a point where we had JTI telemetry doing 60k/sec for 70 devices so we don't metric things we don't need/use.
We deploy the gNMIc cluster using this chart: https://github.com/workfloworchestrator/gnmic-cluster-chart following the instructions here: https://gnmic.openconfig.net/user_guide/HA/
As for deleting the leader lock its relatively simple, when using K8S you need to delete the lease:
k delete lease -n streaming gnmic-collector-leader
We are running on AKS and were running into Kube-API ratelimit issues, when the cluster leaders was managing the leases. We decided to use the other option consul to store the state of the quorum. Consul has a web interface that you can browse to, you can manage what is stored in consul there, and edit/delete stuff.
The clustering mode of gNMIc trusts the cluster leader to dispatch targets to separate workers.
When deploying gNMIc clustered on Kubernetes I'm running into the following problem: When K8s reschedules a pod to a different node or for whatever reason triggers a restart on a pod (resource constraints) the clustering process looses targets and does not recover. What I have to do to recover from this is to delete the leader lock (in consul or the k8s lease) and this will cause the cluster to rebalance and re-acquire targets. I'm deploying the gNMIc process as a statefulset and collecting roughly 22k metrics per second from around 350 nodes so this deployment is resource intensive.
Are there any other users who have an idea how I could deploy gnmic in a better way?
My pipeline is as follows:
Routers -> gNMIc collector -> Kafka -> gNMIc relay -> Influxdb
The collector configuration