kubeshop / botkube

An app that helps you monitor your Kubernetes cluster, debug critical deployments & gives recommendations for standard practices
https://botkube.io
MIT License
2.14k stars 289 forks source link

Reflector ListAndWatch trace error #858

Open bygui86 opened 1 year ago

bygui86 commented 1 year ago

Description

In Botkube logs I see multiple times following trace error:

I1122 08:58:32.111260       1 trace.go:205] Trace[1926275561]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169 (22-Nov-2022 08:58:21.197) (total time: 10913ms):
Trace[1926275561]: ---"Objects listed" error:<nil> 10912ms (08:58:32.110)
Trace[1926275561]: [10.913396443s] [10.913396443s] END
I1122 08:58:33.410831       1 trace.go:205] Trace[2102234783]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169 (22-Nov-2022 08:58:21.197) (total time: 12212ms):
Trace[2102234783]: ---"Objects listed" error:<nil> 12212ms (08:58:33.410)
Trace[2102234783]: [12.212870039s] [12.212870039s] END

What is it related to? How can I fix them?

It seems there is no impact on Botkube functionalities, but it's raising attention of our alerts...

Expected behavior

Do not see any trace error log.

Steps to reproduce

  1. deploy Botkube
  2. watch logs

Additional information

pkosiec commented 1 year ago

Hey @bygui86, This log comes from the Kubernetes client-go package we use. As you can see here, this is reported when listing the resources takes too long time (more than 10s). Interestingly, it looks like in your case it never goes to the next step after "Objects listed", even if the ListAndWatch should list all items and then establish a single connection to watch the resources (after a successful list - here).

I couldn't reproduce this behavior on a few long-running Botkube 0.15 installations (k3d, colima and Oracle Cloud clusters). I'd suspect some issues with your cluster (K8s API server specifically), or some inability to set up watch connection from the Botkube pod. Do you have proper RBAC configured for Botkube pod like in the default configuration?

And yes, we can suppress such logs (like here: https://github.com/werf/werf/pull/1754/files), but I'm not sure if this is a good idea 🤔 This might sometimes help with debugging some issues. But one thing we can do is to link these traces with info log level only.

But here, in this case, I think there's an error which results in constantly listing all resources. Do the Botkube notifications work for you as they should?

bygui86 commented 1 year ago

@pkosiec thanks for your quick answer.

I couldn't reproduce this behavior on a few long-running Botkube 0.15 installations (k3d, colima and Oracle Cloud clusters).

I run Botkube in 2 GKE cluster, one public, one private.

I'd suspect some issues with your cluster (K8s API server specifically), or some inability to set up watch connection from the Botkube pod. Do you have proper RBAC configured for Botkube pod like in the default configuration?

I cannot give Botkube the default amount of permissions, it's too high. Especially about Secrets. I configured RBAC and tested it:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: botkube
rules:
  # K8s
  - apiGroups:
      - ""
    resources:
      - events
      - configmaps
      - endpoints
      - limitranges
      - namespaces
      - nodes
      - persistentvolumeclaims
      - persistentvolumes
      - pods
      - resourcequotas
      - serviceaccounts
      - services
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - apps
    resources:
      - deployments
      - statefulsets
      - daemonsets
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - batch
    resources:
      - cronjobs
      - jobs
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
      - podsecuritypolicies
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - apiextensions.k8s.io
    resources:
      - customresourcedefinitions
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - rbac.authorization.k8s.io
    resources:
      - clusterrolebindings
      - clusterroles
      - rolebindings
      - roles
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - admissionregistration.k8s.io
    resources:
      - mutatingwebhookconfigurations
      - validatingwebhookconfigurations
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - networking.k8s.io
    resources:
      - networkpolicies
      - ingresses
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - scheduling.k8s.io
    resources:
      - priorityclasses
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - storage.k8s.io
    resources:
      - storageclasses
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - snapshot.storage.k8s.io
    resources:
      - volumesnapshotclasses
      - volumesnapshotcontents
      - volumesnapshot
    verbs:
      - get
      - watch
      - list
  # CRDs
  - apiGroups:
      - cloud.google.com
      - internal.autoscaling.gke.io
      - networking.gke.io
      - hub.gke.io
      - nodemanagement.gke.io
      - crd.projectcalico.org
      - rbacmanager.reactiveops.io
      - cassandraoperator.instaclustr.com
      - kafka.strimzi.io
      - monitoring.coreos.com
      - logging.banzaicloud.io
      - logging-extensions.banzaicloud.io
      - traefik.containo.us
    resources:
      - "*"
    verbs:
      - get
      - watch
      - list

And yes, we can suppress such logs (like here: https://github.com/werf/werf/pull/1754/files), but I'm not sure if this is a good idea 🤔 This might sometimes help with debugging some issues. But one thing we can do is to link these traces with info log level only.

I see your point but this should be a "debug" level log, not info. I suggest not to suppress it, but to make level lower so we could enable it in case of deeper debugging required.

But here, in this case, I think there's an error which results in constantly listing all resources. Do the Botkube notifications work for you as they should?

Notifications seem to work fine, but we have a huge configuration so I cannot distinguish if we are getting the same amount of notifications as before. I noticed that such logs appears 4/5 time after Botkube startup and then never more for hours.

pkosiec commented 1 year ago

I noticed that such logs appears 4/5 time after Botkube startup and then never more for hours.

Oh, okay, so that means it's not a big issue. Alright, I think we can reconfigure klog based on the log configuration 👍 The only hesitation from my side is that the traces you see are actually in the klog INFO level, so I'm not sure if they should be moved to debug 🤔 @mszostok WDYT?

bygui86 commented 1 year ago

@pkosiec for now not a big issue, but I don't know long-term. Anyway I configured Botkube to log in JSON format, so it would be really nice to have such logs converted in JSON as well.