okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.72k stars 295 forks source link

When using ingressVIP, machineconfigs 00-worker and 00-master get a faulty NetworkManager script, that leads to DNS issues #298

Closed philipp1992 closed 3 years ago

philipp1992 commented 4 years ago

We are using OKD4.5-GA release with vsphere cloud provider.

In our installer.yaml we have set ingressVIP and apiVIP.

Using this installation parameters, the installer will generate a 00-worker and 00-master machineconfig, that includes the following file:

/etc/NetworkManager/dispatcher.d/30-resolv-prepender

content of this file:

#!/bin/bash
IFACE=$1
STATUS=$2
# If $DHCP6_FQDN_FQDN is not empty and is not localhost.localdomain
[[ -n "$DHCP6_FQDN_FQDN" && "$DHCP6_FQDN_FQDN" != "localhost.localdomain" && "$DHCP6_FQDN_FQDN" =~ "." ]] && hostnamectl set-hostname --static --transient $DHCP6_FQDN_FQDN
case "$STATUS" in
    up|down|dhcp4-change|dhcp6-change)
    logger -s "NM resolv-prepender triggered by ${1} ${2}."
    NAMESERVER_IP=$(/usr/bin/podman run --rm \
        --authfile /var/lib/kubelet/config.json \
        --net=host \
        quay.io/openshift/okd-content@sha256:3c7ec93feeb79561438ef453ed25d1e00d95f67f6429eb3d475d29e8a44318c1 \
        node-ip \
        show \
        "10.0.38.200" \
        "10.0.38.201")
    DOMAIN="**<our-cluster-id>.<our-base-domain>**"
    if [[ -n "$NAMESERVER_IP" ]]; then
        logger -s "NM resolv-prepender: Prepending 'nameserver $NAMESERVER_IP' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)"
        sed -e "/^search/d" \
            -e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script\nsearch $DOMAIN\nnameserver $NAMESERVER_IP" \
            /var/run/NetworkManager/resolv.conf > /etc/resolv.tmp
    else
        logger -s "Couldn't find a non-virtual IP, just updating resolv.conf"
        cp /var/run/NetworkManager/resolv.conf /etc/resolv.tmp
    fi
    mv -f /etc/resolv.tmp /etc/resolv.conf
    ;;
    *)
    ;;
esac

This file will lead to NetworkManager creating the following /etc/resolv.conf on our FCOS machines:

search x x x x our-cluster-id.our-base-domain

This resolv.conf will lead to all pods resolving any DNS entry to our ingressVIP, because the will append the wildcard dns domain to any hostname.

The $DOMAIN variable in the script should not include the cluster-id-base-domain, but just the DNS suffix or nothing at all.

philipp1992 commented 4 years ago

this does not happen, if we dont set the ingressVIP, but use an external loadbalancer.

vrutkovs commented 4 years ago

This prepender would add CoreDNS entry, but .cluster.local resolves won't work on host

vrutkovs commented 4 years ago

It seems CoreDNS config has been updated in latest stable - https://github.com/openshift/machine-config-operator/commit/4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb - let see if that has a fix

EagleIJoe commented 4 years ago

This happened on openshift-install 4.5.0-0.okd-2020-08-12-020541 using ovirt IPI when I did not set the correct value for machineNetwork in the install-config ... I got this in NetworkManager log:

NetworkManager[694]: [1597309259.6449] dispatcher: (4) /etc/NetworkManager/dispatcher.d/30-resolv-prepender failed (failed): Script '/etc/NetworkManager/dispatcher.d/30-resolv-prepender' exited with error status 125.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/okd/issues/298#issuecomment-761909126): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.