Log when the agent is not able to discover services/pods

QuentinBisson commented 2 years ago

When trying out the agent v0.26.1 on a cluster with an invalid rbac configured (i.e. invalid service account), the agent is not logging any error so it makes it impossible to debug what is failing as there are no errors returned in the logs:

ts=2022-08-17T08:56:34.585432153Z caller=server.go:191 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2022-08-17T08:56:34.585885928Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2022-08-17T08:56:34.586011104Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2022-08-17T08:56:34Z level=info caller=traces/traces.go:143 msg="Traces Logger Initialized" component=traces
ts=2022-08-17T08:56:34.589747493Z caller=integrations.go:138 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2022-08-17T08:56:34.598095095Z caller=reporter.go:107 level=info msg="running usage stats reporter"
ts=2022-08-17T08:56:34.601227094Z caller=wal.go:197 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 msg="replaying WAL, this may take a while" dir=/var/lib/grafana-agent/data/f991db25d94df6b6a2d34e419c0bdac7/wal
ts=2022-08-17T08:56:34.602215367Z caller=wal.go:244 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-08-17T08:56:34.602621943Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.603232667Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.603757106Z caller=kubernetes.go:313 level=info agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component="discovery manager" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-08-17T08:56:34.605157196Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Starting WAL watcher" queue=f991db-4b5dd5
ts=2022-08-17T08:56:34.605333032Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Starting scraped metadata watcher"
ts=2022-08-17T08:56:34.605842841Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Replaying WAL" queue=f991db-4b5dd5
ts=2022-08-17T08:56:39.427639186Z caller=entrypoint.go:249 level=info msg="reload of config file requested"
ts=2022-08-17T08:56:45.698794738Z caller=dedupe.go:112 agent=prometheus instance=f991db25d94df6b6a2d34e419c0bdac7 component=remote level=info remote_name=f991db-4b5dd5 url=http://thanos-receive-ingestor-default.thanos.svc:19291/api/v1/receive msg="Done replaying WAL" duration=11.092992576s

In the mean time, with the same service monitors, prometheus operator in agent mode logs those errors:

ts=2022-08-17T09:21:06.200Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:15.976Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:15.976Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:41.453Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"thanos\""
ts=2022-08-17T09:21:41.454Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"thanos\""

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

hervenicol commented 2 years ago

Do we really want this one to be closed? This has a real usage impact IMO, and correct error logging would really help with using grafana-agent.

rfratto commented 2 years ago

👋 Closing an issue as stale doesn't mean we don't want to do it or that we don't think it's important, just that it's not currently prioritized.

We're also currently exploring whether it makes sense for us to get rid of the stalebot and change how we manage the issue queue. In the meantime, I'll reopen this and tag it keepalive for now.

rfratto commented 2 years ago

After a live discussion in our community call, it seems like the biggest issue here is that Kubernetes SD is hiding errors when it can't connect to Kubernetes. This is something we'll need to fix upstream so that the discovery errors get exposed as log lines.

rfratto commented 1 year ago

I've run into this again and tracked down why nothing gets logged. It turns out that Kubernetes' client is hiding the error logs by default, and you have to explicitly install a hook to get them via SetWatchErrorHandler. This is still something that needs to be fixed upstream in Prometheus though.

jcreixell commented 1 year ago

I tested this and can confirm that it makes errors show up in logs. Agreed that this needs to be fixed upstream.

rfratto commented 7 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

grafana / alloy

Log when the agent is not able to discover services/pods #531