openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
168 stars 54 forks source link

K8s Clustering Leader Receives Context Deadline Error on Target #393

Closed EntainAustralia closed 5 months ago

EntainAustralia commented 5 months ago

One of our two targets receives a context deadline error very quickly (every 0.1 second). Can this rate be reduced somewhere or can we change something else to fix it?

Config

    log: true

    loader:
      type: file
      path: /app/targets-config.yaml
      interval: 30s
      enable-metrics: false

    clustering:
      cluster-name: cluster1
      targets-watch-timer: 30s
      locker:
        type: k8s
        namespace: gnmic
        lease-duration: 30s
        renew-period: 
        retry-timer: 30s
        debug: true

    subscriptions:
      interfaces:
        prefix:
        target:
        set-target:
        paths:
          - interfaces/interface/state/oper-status
        models: []
        mode: STREAM
        stream-mode: SAMPLE
        encoding:  PROTO
        qos:
        sample-interval: 10s
        heartbeat-interval:
        suppress-redundant:
        updates-only:
        outputs:
    outputs:
      prom:
        type: prometheus
        listen: "0.0.0.0:9804" 
        path: /metrics 
        expiration: 60s 
        metric-prefix: "gnmic" 
        append-subscription-name: false 
        override-timestamps: false
        export-timestamps: false 
        strings-as-labels: true
        cache: 
        timeout: 10s
        debug: true 
        add-target: 
        target-template:
        event-processors:
        num-workers: 1

Logs

2024/03/11 02:18:42.401237 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:42.502200 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:42.602368 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:42.703191 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:42.803523 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:42.904003 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.004392 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.105292 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.205649 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.306297 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.406453 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.507565 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.607855 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.708056 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.808335 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:43.909354 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.009764 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.109974 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.210221 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.310872 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.411229 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.512322 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.612520 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.713079 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.813434 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:44.913785 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:45.014109 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:45.115067 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
2024/03/11 02:18:45.215465 [gnmic] failed getting value of {{ TARGET NAME }}: client rate limiter Wait returned an error: context deadline exceeded
karimra commented 5 months ago

This might be a bug, gNMIc is trying to list all leases with a context that already reached its deadline. The 0.1 seconds is an internal retry timer, changing it won't solve this problem. Can you share which version of gNMI you are running ?

Monkman08 commented 5 months ago

Seeing this on 0.36.2, This also could be a misconfiguration due to the limited documentation around the k8s locker. We configured the k8s locker based on a comment in https://github.com/karimra/gnmic/issues/560#issuecomment-1102234193

EntainAustralia commented 5 months ago

We are on version 0.36.2

version : 0.36.2
 commit : a7844a6d
   date : 2024-03-05T20:10:26Z
 gitURL : https://github.com/openconfig/gnmic
   docs : https://gnmic.openconfig.net
EntainAustralia commented 5 months ago

The solution was to make sure the ENV variables were configured correctly. These were removed originally because they were not documented or explained in any way.

            - name: GNMIC_CLUSTERING_INSTANCE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: GNMIC_CLUSTERING_SERVICE_ADDRESS
              value: "$(GNMIC_CLUSTERING_INSTANCE_NAME).gnmic-svc.gnmic.svc.cluster.local"
karimra commented 5 months ago

The ENV variables follow a naming pattern explained here: https://gnmic.openconfig.net/user_guide/configuration_env/#configuration-file-to-environment-variables-mapping

GNMIC_CLUSTERING_INSTANCE_NAME=XXX and GNMIC_CLUSTERING_SERVICE_ADDRESS=YYY are equivalent to:

clustering:
  instance-name: XXX
  service-address: YYY