grafana / helm-charts

Apache License 2.0
1.67k stars 2.28k forks source link

Loki DNS Resolution Errors for loki-backend-headless.grafana.svc.cluster.local #3280

Open bpsizemore opened 3 months ago

bpsizemore commented 3 months ago

I'm trying to setup Loki in SimpleScalableMode per the documentations recommendation.

Values file looks like:

        loki:
          schemaConfig:
            configs:
              - from: 2024-04-01
                store: tsdb
                object_store: s3
                schema: v13
                index:
                  prefix: loki_index_
                  period: 24h
          ingester:
            chunk_encoding: snappy
          tracing:
            enabled: true
          querier:
            # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
            max_concurrent: 4

        gateway:
          enabled: true
          <redacted>         

        deploymentMode: SimpleScalable

        backend:
          replicas: 1
        read:
          replicas: 1
        write:
          replicas: 1

        # Enable minio for storage
        minio:
          enabled: true

        # Zero out replica counts of other deployment modes
        singleBinary:
          replicas: 0

        ingester:
          replicas: 0
        querier:
          replicas: 0
        queryFrontend:
          replicas: 0
        queryScheduler:
          replicas: 0
        distributor:
          replicas: 0
        compactor:
          replicas: 0
        indexGateway:
          replicas: 0
        bloomCompactor:
          replicas: 0
        bloomGateway:
          replicas: 0

Deployment succeeds and everything appears to come up clean, but I'm having issues with the loki-read.

Logs indicate that it cannot find an IP for the loki-backend-headless svc.

level=error ts=2024-08-21T02:16:39.423427136Z caller=ring_watcher.go:56 component=querier component=querier-scheduler-worker msg="error getting addresses from ring" err="empty ring"
level=error ts=2024-08-21T02:16:41.41970996Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:16:41.419738195Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=error ts=2024-08-21T02:16:42.423381731Z caller=ring_watcher.go:56 component=querier component=querier-scheduler-worker msg="error getting addresses from ring" err="empty ring"
level=info ts=2024-08-21T02:16:45.42345762Z caller=worker.go:231 component=querier msg="adding connection" addr=10.11.1.106:9095
level=error ts=2024-08-21T02:16:46.420802204Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:16:46.420835566Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=info ts=2024-08-21T02:16:46.422918838Z caller=frontend_scheduler_worker.go:106 msg="adding connection to scheduler" addr=10.11.1.106:9095
level=error ts=2024-08-21T02:16:51.420235794Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:16:51.420261567Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=error ts=2024-08-21T02:16:56.419255193Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:16:56.419286921Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=error ts=2024-08-21T02:17:01.419300883Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:17:01.419328957Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=error ts=2024-08-21T02:17:06.42047791Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:17:06.420504075Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local
level=error ts=2024-08-21T02:17:11.41913895Z caller=resolver.go:87 index-store=tsdb-2024-04-01 msg="failed to lookup IP addresses" host=loki-backend-headless.grafana.svc.cluster.local err="lookup loki-backend-headless.grafana.svc.cluster.local on 172.20.0.10:53: no such host"
level=warn ts=2024-08-21T02:17:11.419166877Z caller=resolver.go:134 index-store=tsdb-2024-04-01 msg="IP address lookup yielded no results. No host found or no addresses found" host=loki-backend-headless.grafana.svc.cluster.local

However, service exists and is present:

k describe svc loki-backend-headless
Name:              loki-backend-headless
Namespace:         grafana
Labels:            app.kubernetes.io/component=backend
                   app.kubernetes.io/instance=grafana-loki
                   app.kubernetes.io/name=loki
                   argocd.argoproj.io/instance=grafana-loki
                   prometheus.io/service-monitor=false
                   variant=headless
Annotations:       <none>
Selector:          app.kubernetes.io/component=backend,app.kubernetes.io/instance=grafana-loki,app.kubernetes.io/name=loki
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              http-metrics  3100/TCP
TargetPort:        http-metrics/TCP
Endpoints:         10.11.1.106:3100
Port:              grpc  9095/TCP
TargetPort:        grpc/TCP
Endpoints:         10.11.1.106:9095
Session Affinity:  None
Events:            <none>

Furthermore - DNS resolution appears to be working as expected on the loki-read pod.

k exec loki-read-86f6767445-l6bdf -- /bin/sh -c 'nslookup loki-backend-headless.grafana.svc.cluster.local'
Server:     172.20.0.10
Address:    172.20.0.10:53

Name:   loki-backend-headless.grafana.svc.cluster.local
Address: 10.11.1.106

This also matches the IP for my backend:

k describe pod loki-backend-0 | grep 'IP'
IP:               10.11.1.106
IPs:
  IP:           10.11.1.106

The logs on the backend pod look normal - no errors. Not sure what other steps I should be doing to debug/identify the issue. I'm able to connect from Grafana to the loki instance successfully with credentials over my gateway, and a random X-Scope-OrgID header, but no data or labels ever load. I also tried querying the api directly to get labels to see if there is any particular issue. It just hangs and generates this log on the reader level=info ts=2024-08-21T02:39:30.290268674Z caller=roundtrip.go:430 org_id=test traceID=33d5314efa98d4c6 msg="executing query" type=labels label= length=1h0m0s query=

Cenness commented 2 months ago

Running in microservices mode. Had the same error logs. I've changed my config from

server_address: dns+loki-index-gateway-headless.logging.svc.cluster.local:9095

to

server_address: dns:///loki-index-gateway.logging.svc.cluster.local:9095

and that error disappeared. But queries are now even longer, sometimes x2 than before.

Source:
https://github.com/grafana/loki/blob/48e4a346b87bfab060ebcfcec008655341981ece/docs/sources/operations/storage/tsdb.md?plain=1#L52

Edit: changed to

server_address: loki-index-gateway.logging.svc.cluster.local:9095

Query request duration returned to expected values.