grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24k stars 3.46k forks source link

[Loki] [Helm Chart] Loki read pod not able to talk with coredns when network policy is enabled #11650

Open BennyGoo opened 10 months ago

BennyGoo commented 10 months ago

Describe the bug the loki read pod log shows:

level=info ts=2024-01-11T10:17:56.155840621Z caller=loki.go:505 msg="Loki started"
level=warn ts=2024-01-11T10:17:56.165650955Z caller=dns_resolver.go:225 msg="failed DNS A record lookup" err="lookup query-scheduler-discovery.moni-loki-agpl.svc.cluster.local. on 100.108.0.10:53: read udp 100.104.2.84:44871->100.108.0.10:53: i/o timeout"
level=info ts=2024-01-11T10:17:57.554422974Z caller=frontend.go:316 msg="not ready: number of schedulers this worker is connected to is 0"
level=warn ts=2024-01-11T10:18:06.157005143Z caller=dns_resolver.go:225 msg="failed DNS A record lookup" err="lookup query-scheduler-discovery.moni-loki-agpl.svc.cluster.local. on 100.108.0.10:53: read udp 100.104.2.84:33803->100.108.0.10:53: i/o timeout"
level=info ts=2024-01-11T10:18:07.555110388Z caller=frontend.go:316 msg="not ready: number of schedulers this worker is connected to is 0"

it looks like the loki read pod encoutner network issue. if I disable the networkPolicy by networkPolicy.enabled=false, then this issue doesn't occur.

p.s: another tiny issue: after installing with HELM, the Installed components always say grafana-agent-operator is installed. however, actually I diabled the grafana agent from the values. (see below)

To Reproduce Steps to reproduce the behavior: HELM values:

backend:
  autoscaling:
    enabled: true
  persistence:
    size: 1Gi
    volumeClaimsEnabled: true
  replicas: 3
gateway:
  enabled: false
global:
  image:
    registry: public.int.repositories.******.******
ingress:
  enabled: false
loki:
  memcached:
    chunk_cache:
      enabled: true
    results_cache:
      enabled: true
  storage:
    bucketNames:
      admin: loki-devops-admin
      chunks: loki-devops-chunks
      ruler: loki-devops-ruler
    s3:
      accessKeyId: **************
      endpoint: https://objectstore-3.**********
      region: *****
      secretAccessKey:************
    type: s3
memberlist:
  service:
    publishNotReadyAddresses: true
monitoring:
  lokiCanary:
    enabled: true
  selfMonitoring:
    grafanaAgent:
      installOperator: false
networkPolicy:
  enabled: true
  externalStorage:
    cidrs: []
    ports: []
read:
  autoscaling:
    enabled: true
    maxReplicas: 6
    minReplicas: 2
  replicas: 3
write:
  autoscaling:
    enabled: true
    maxReplicas: 6
    minReplicas: 2
  persistence:
    size: 1Gi
    volumeClaimsEnabled: true
  replicas: 3

Expected behavior A clear and concise description of what you expected to happen.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

camaeel commented 8 months ago

Seems like when I set in CiliumNetworkPolicy loki-egress-dns:

    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns

it started to work