sourcegraph / deploy

Sourcegraph Machine Images
Apache License 2.0
4 stars 4 forks source link

Fix: internal DNS failure #109

Closed jdpleiness closed 1 month ago

jdpleiness commented 1 month ago

Closes https://linear.app/sourcegraph/issue/REL-406/gcp-image-docs-do-not-produce-a-working-gcp-deployment Closes https://linear.app/sourcegraph/issue/REL-388/dns-resolution-issue-for-private-endpoint-in-k3s-environment

This allows for both GCP and AWS images to resolve internal DNS endpoints in a VPC that are available via the VPC metadata endpoint 169.254.169.254.

This is due to how modern linux systems run an internal resolver via systemd which is not able to be accessed by coreDNS, the DNS service in our k3s cluster. CoreDNS also does not sequentially work it's way through the list of resolvers it is provided via the host, but instead does a "sticky round robin" meaning if given your-internal-dns-server and 8.8.8.8, coreDNS will try one of those at random. If it doesn't get an error, it may keep using that one for a certain amount of time and not go back to even trying the other. Not finding the domain is not considered an error by coreDNS as well, meaning coreDNS may choose to ignore your private DNS server all together from what I have seen in testing.

This should be able to be overridden via an override entry such as the one shown below, however I could not get this to actually take precedence after many attempts.

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  forward.override: |
    forward . 169.254.169.254 /etc/resolv.conf {
      policy sequential
    }

This PR instead modifies the coreDNS manifest directly on every reboot to ensure settings are applied correctly.

Testing

Tested manually CleanShot 2024-09-13 at 15 25 42

Chickensoupwithrice commented 1 month ago

This is a fucking insane bug.

Really nice catch! :ship: it baby

jdpleiness commented 1 month ago

Merge activity