rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.55k stars 267 forks source link

Prometheus operator installation makes DNS service to be unreachable #2173

Closed sylvainOL closed 2 years ago

sylvainOL commented 2 years ago

Environmental Info:

RKE2 Version:

rke2 version v1.22.3+rke2r1 (b426ae1eda82b9133a9c22957531c873b27e93f1)
go version go1.16.7b7

Node(s) CPU architecture, OS, and Version:

same for all (also tested with centos 8 and ubuntu 20.04) :

Linux control01-onap-gating-1 5.10.0-9-cloud-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

Cluster Configuration: 7 servers (on VM):

Describe the bug:

When rke2 is deployed without any hardening, installation of prometheus-operator makes DNS resolution (internal and external) to stop working. With hardening (see in bottom what's meant by setting / unsetting hardening), this is working fine! Using kubespray and prometheus-operator, this is also working fine!

Steps To Reproduce:

we're using ansible playbooks to install rke2 and kubernetes services on top of it:

rke2: https://gitlab.com/Orange-OpenSource/lfn/infra/rke2_automatic_installation_collection services: https://gitlab.com/Orange-OpenSource/lfn/infra/kubernetes_collection

these are (supposed to be ;) ) reproductible builds

for rke2, here's an example of configuration files

server

token: XXXX
node-name: control01-onap-gating-1
cni: cilium # also tested with canal
kube-controller-manager-arg: # for scraping datas via prometheus
  - "address=0.0.0.0"
  - "bind-address=0.0.0.0"
kube-scheduler-arg:  # for scraping datas on prometheus
  - "address=0.0.0.0"
  - "bind-address=0.0.0.0"
etcd-expose-metrics: true  # for scraping datas on prometheus
tls-san:
  - 10.4.11.224 # "public" IP
# Disallow workload on control node when we have at least one agent node
node-taint:
  - "CriticalAddonsOnly=true:NoExecute"

agent

---
server: https://10.253.0.212:9345
token: XXXX
node-name: compute03-onap-gating-1

for prometheus, here's the override file used:

prometheus

---
#FIXME: wait until this issue is fixed
# https://github.com/rancher/rke2/issues/1100 and delete these 3 lines
defaultRules:
  rules:
    etcd: false

alertmanager:
  enabled: true
  ingress:
    enabled: true
    hosts:
      - alertmanager.api.simpledemo.onap.org
    paths:
      - "/"
    pathType: "Prefix"
  config:
    global:
      resolve_timeout: 5m

    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: email
    receivers:
      - name: "null"
      - name: email
        email_configs:
          - to: firstname1.name1@mail.com
            smarthost: :25
            from: prometheus@mail.com
            auth_username: user
            auth_password: password
            require_tls: False
            send_resolved: True

  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage:
                1Gi

grafana:
  ingress:
    enabled: true
    hosts:
      - grafana.api.simpledemo.onap.org
    paths:
      - "/"
    pathType: "Prefix"

kubeApiServer:
  tlsConfig:
    insecureSkipVerify: true

#FIXME: wait until this issue is fixed
# https://github.com/rancher/rke2/issues/1100 and delete these 4 lines
kubeEtcd:
  enabled: false
  serviceMonitor:
    enabled: false

#FIXME: wait until this issue is fixed
# https://github.com/rancher/rke2/issues/1100 and uncomment these 4 lines
#kubeEtcd:
#  service:
#    port: 2381
#    targetPort: 2381

prometheus:
  ingress:
    enabled: true
    hosts:
      - prometheus.api.simpledemo.onap.org
    paths:
      - "/"
    pathType: "Prefix"
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage:
                25Gi

    additionalScrapeConfigs:
      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
          - role: endpoints

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            action: keep
            regex: true
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels:
              - __meta_kubernetes_namespace
            action: replace
            target_label: kubernetes_namespace
          - source_labels:
              - __meta_kubernetes_service_name
            action: replace
            target_label: kubernetes_name
          - source_labels:
              - __meta_kubernetes_pod_node_name
            action: replace
            target_label: kubernetes_node

      - job_name: 'kubernetes-service-endpoints-slow'

        scrape_interval: 5m
        scrape_timeout: 30s

        kubernetes_sd_configs:
          - role: endpoints

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
            action: keep
            regex: true
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels:
              - __meta_kubernetes_namespace
            action: replace
            target_label: kubernetes_namespace
          - source_labels:
              - __meta_kubernetes_service_name
            action: replace
            target_label: kubernetes_name
          - source_labels:
              - __meta_kubernetes_pod_node_name
            action: replace
            target_label: kubernetes_node

      - job_name: 'prometheus-pushgateway'
        honor_labels: true

        kubernetes_sd_configs:
          - role: service

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_probe
            action: keep
            regex: pushgateway

      - job_name: 'kubernetes-services'

        metrics_path: /probe
        params:
          module: [http_2xx]

        kubernetes_sd_configs:
          - role: service

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_probe
            action: keep
            regex: true
          - source_labels:
              - __address__
            target_label: __param_target
          - target_label: __address__
            replacement: blackbox
          - source_labels:
              - __param_target
            target_label: instance
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels:
              - __meta_kubernetes_namespace
            target_label: kubernetes_namespace
          - source_labels:
              - __meta_kubernetes_service_name
            target_label: kubernetes_name

      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
          - role: pod

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            action: keep
            regex: true
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
            action: replace
            regex: (https?)
            target_label: __scheme__
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels:
              - __meta_kubernetes_namespace
            action: replace
            target_label: kubernetes_namespace
          - source_labels:
              - __meta_kubernetes_pod_name
            action: replace
            target_label: kubernetes_pod_name
          - source_labels:
              - __meta_kubernetes_pod_phase
            regex: Pending|Succeeded|Failed
            action: drop

      - job_name: 'kubernetes-pods-slow'

        scrape_interval: 5m
        scrape_timeout: 30s

        kubernetes_sd_configs:
          - role: pod

        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
            action: keep
            regex: true
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scheme
            action: replace
            regex: (https?)
            target_label: __scheme__
          - source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels:
              - __meta_kubernetes_namespace
            action: replace
            target_label: kubernetes_namespace
          - source_labels:
              - __meta_kubernetes_pod_name
            action: replace
            target_label: kubernetes_pod_name
          - source_labels:
              - __meta_kubernetes_pod_phase
            regex: Pending|Succeeded|Failed
            action: drop

scenarios tested

All servers are in the same (OpenStack) network.

We tried with the following OS:

We tried with the following cnis:

We tried the following versions:

We tried with and without nodelocaldns

Expected behavior:

a working kubernetes with monitoring using prometheus enabled

Actual behavior:

DNS resolution fails (DNS requests doesn't arrive on coredns pods) when prometheus operator gets installed (dns resolution is fine when not installed)

Additional context / logs:

by without hardening, I mean: without setting this:

profile: cis-1.5
protect-kernel-defaults: true
selinux: true
write-kubeconfig-mode: "0600"

everything is reproductible and I'm able to fire an environment to help debugging.

sylvainOL commented 2 years ago

I know this is a weird issue and I'm more reaching for help to troubleshoot than something else

I tried to look at iptables rules created but nothing seemed weird for me

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.